Machine learning open source tools

2010-05-12  来源:本站原创  分类:Internet  人气:247 

Most of these tools are open source, based on GPL, Apache and other open-source protocol, using the tool, please read the license statement

I. Information Retrieval
1. Lemur / Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine

2. Lucene / Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene is a top-level apache open source project, based on Apache 2.0 protocol, written entirely in java, with perl, c / c + +, dotNet, and other port
http://lucene.apache.org/
http://www.nutch.org/

3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X- Windows support, etc.
http://www.gnu.org/software/wget/wget.html

II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
Includes four tools GIZA

2. GIZA + + (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA + + is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP / JHU). GIZA + + includes a lot of additional features. The extensions of GIZA + + were designed and written by Franz Josef Och.
Franz Josef Och Aachen University in Germany, ISI (Institute of Information Science, University of Southern California) and Google. GIZA + + is now available for Windows transplantation version of IBM's model 1-5 has a good support.

3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models

4. OpenNLP:
http://opennlp.sourceforge.net/
Including more than 20 tools Maxent

btw: The SMT also like to use a tool named after the Egyptian-related, like what GIZA, PHARAOH, Cairo and so on. Och when the ISI developed the GIZA + +, PHARAOH also from the development of ISI's Philipp Koehn, relationship really is complex ah

5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
After filling a form binary free download
http://www.cs.ualberta.ca/ ~ lindek / minipar.htm

6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet latest version is 2.1 (for Windows & Unix-like OS), providing bin, src, and doc.
The online version of WordNet is http://wordnet.princeton.edu/perl/webwn

7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
By the CAS, Zhendong Dong & Qiang Dong development, is the stuff of a similar WordNet

8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/ ~ prc14/toolkit.html

The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.

10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.

11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering

III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
Prepared by the Franz Josef Och. In addition, OpenNLP project there is a java tool for MaxEnt, the estimated parameters using the GIS, from Northeastern University, Chang Le (currently studying in the UK) port for the C + + version

2. LibSVM
From the National Taiwan University (ntu) of Chih-Jen Lin developed a C + +, Java, perl, C # and other languages
http://www.csie.ntu.edu.tw/ ~ cjlin / libsvm /
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification.

3. SVM Light
The Thorsten Joachims from the cornell university development at dortmund become the most famous after LibSVM SVM package. Open source, C language, for the ranking problem
http://svmlight.joachims.org/

4. CLUTO
http://www-users.cs.umn.edu/ ~ karypis / cluto /
a software package for clustering low-and high-dimensional datasets
This package only executable / library in two forms, does not provide the source code download

5. CRF + +
http://chasen.org/ ~ taku / software / CRF + + /
Yet Another CRF toolkit for segmenting / labelling sequential data
CRF (Conditional Random Fields), the HMM / MEMM developed, widely used in IE, IR, NLP in the field

6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
With SVM Light, by cornell's Thorsten Joachims development.
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X -> Y
using labeled training examples (x1, y1), ..., (xn, yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (eg for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (eg for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (ie states). The goal is to predict the tag sequences for new sentences.

IV. Misc:
1. Notepad + +: an open source editor, supports C #, perl, CSS and many other languages, keyword, function with the new version of UltraEdit, Visual Studio. NET comparable
http://notepad-plus.sourceforge.net

2. WinMerge: for text comparison to find two different versions of different programs
winmerge.sourceforge.net /

3. OpenPerlIDE: open source, perl editor, built-in compiler, line by line debugging
open-perl-ide.sourceforge.net /
ps: from the editor of the best ever seen or even VS. NET, and in front of each function + / - numbers support expand / collapse, support regional copy / cut / paste, use ctrl + c / ctrl + x / ctrl + v to a select line, use ctrl + k + c / ctrl + k + u can comment / uncomment multiple lines, and there's also ...... Visual Studio. NET is really kool: D

4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB is not a relational database, it is called an embedded database: for c / s model, it's client and server share a single address space. Since the database was initially developed from the file system, it is more like a key-value pair of words typical of the database. And the database file can be serialized to disk, so free memory size limit. BDB has sub-version of the Berkeley DB XML, which is an xml database: The xml file is stored data? BDB has been included microsoft, google, HP, ford, motorola and others into their own products to the
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client / server applications. It includes b + tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging , shared memory caching, database recovery, and replication for highly available systems. DB supports C, C + +, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", " check if this key exists "and" retrieve the value for this key "so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.

11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT & T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control .
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R statistical software and MatLab similar, are used in scientific computing.

Transfer from: http://kapoc.blogdriver.com/kapoc/1268927.html
[/ List]
[/ Size]

相关文章
  • Machine learning open source tools 2010-05-12

    Most of these tools are open source, based on GPL, Apache and other open-source protocol, using the tool, please read the license statement I. Information Retrieval 1. Lemur / Indri The Lemur Toolkit for Language Modeling and Information Retrieval ht

  • Machine learning and artificial intelligence learning resources guide 2011-05-08

    Peter Norvig's "AI, Modern Approach 2nd" (the classic non-controversial areas) Bishop, "Pattern Recognition and Machine Learning". No copying, but can be down to the line. The classic of classics. Pattern Classification and this book i

  • Based on statistical probability and machine learning text categorization - Community products pre-feasibility report of the machine audit mechanism 2010-12-01

    Based on statistical probability and machine learning text classification techniques - Community Audit Mechanism Machine Products First, the status quo Currently, the community where the company products (forum, blog, Wikipedia) receives every day a

  • [Change] Mware virtual machine to install VMware tools in ubuntu 2010-04-30

    VMware virtual machine to install VMware tools in ubuntu November 8, 2008 Saturday 19:55 VMware virtual machine to install VMware tools in ubuntu 1, we must first install build-essential software, or not be able to successfully install the command: s

  • [Change] VMware virtual machine to install VMware tools in ubuntu 2010-04-30

    VMware virtual machine to install VMware tools in ubuntu November 8, 2008 Saturday 19:55 VMware virtual machine to install VMware tools in ubuntu 1, we must first install build-essential software, or not be able to successfully install the command: s

  • [Transferred] to the statistical machine learning from artificial intelligence 2010-07-19

    Very productive day today, I visit the Microsoft Research Asia, the boss of the field of machine learning, Daniel, Li Hang researchers to work great lecture, and missed the last opportunity for direct interaction with (last time he arrived, I was jus

  • Bayesian. Probability distribution and machine learning 2010-09-28

    China Branch of the cattle were written by a very clear http://qun.qq.com/air/ # 9826518/bbs/view/cd/9/td/4 / [aio] One. Simply Bayes theorem: Bayesian mathematical way to explain life in the form we all know common sense is often the most simple the

  • Machine learning concepts 2010-11-02

    Machine Learning Data from a large number of automatic or semi-automatic mode to find the course useful. Construction of mathematical models of machine learning is the use of a statistical theory, the core task of reasoning from the sample Learning:

  • Learning Machine Learning Papers must-see 2010-12-19

    To find out more about Machine Learning, see the top conferences and journals in the field, including: • International Conference on Machine Learning (ICML). • Conference on Neural Information Processing Systems (NIPS). • Annual Conference on Learnin

  • Pattern recognition and machine learning notes Chapter linear regression model 2011-04-27

    Linear Regression Model Zhe Zhang third chapter Jieshao linear regression model, the goal is to return to the issue to a D-Wei input variables, to predict a continuous variable or more target values. The first chapter has introduced the polynomial cu

  • Stanford University Open Course: Machine Learning Course 2011-04-28

    http://bb.news.qq.com/a/20110107/000022.htm Stanford University Open Course: Machine Learning Course

  • 机器学习(Machine Learning)&深度学习(Deep Learning)资料 2014-06-29

    <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.Deep Learning. <Deep Learning in Neural Networks: An Overview> 介绍:这是瑞士人工智能实验室Jurgen Schmidhuber写的最新版本<神经网络与深度学习综述>本综述的特点是以时间排序,从1940年开始讲起,到60-80

  • Records of learning open source search engine, Solr, Lucene process and experience 2010-03-09

    To learn any new things are always a perplexed to know from a process and then to suddenly see the light, (looks like and love like ha ha ha) Access to this record a little further open source search engine Lucene and learning step by step since solr

  • Records of learning open source search engines Solr, Lucene process and experience 2010-03-09

    To learn any new things are always a perplexed to know from a process and then to suddenly see the light, (looks like and love like ha ha ha) Access to this record a little further open source search engine Lucene and learning step by step since solr

  • VM in the Linux virtual machine to install VMware Tools 2010-12-21

    Use VMware to install Linux, VMware-Tools will be the head spin. Now to tell you in detail how to install Linux under VMwareTools. 1. Installation Tool. Let the virtual machine running on window state, not full screen, press Ctrl + Alt to release the

  • Pattern recognition and machine learning notes Chapter linear classification model (a) 2011-05-26

    Reproduced, please indicate the source: http://fuliang.iteye.com/blog/1060530 In the previous chapter, we have seen a very simple linear regression model with analytical and computational. I like what we discuss this model to solve classification pro

  • Some open source tools for Flex 2010-04-15

    1. Flexbox http://flexbox.mrinalwadhwa.com/ This is a flex developer from India in February 2007 to establish a flex component library, which also has a lot of good things. 2. Flexlib http://code.google.com/p/flexlib/ And others established by the Do

  • Thorough understanding of the virtual machine to install VMware tools 2010-12-17

    Virtual machine, VMware, tools Do not know how to install VMware tools in a virtual machine must have been just playing the system in a virtual machine for beginners, no doubt our understanding of the virtual machine is not deep, making it could be e

  • Copy the open source tools DataGrid and AdvancedDataGrid 2010-12-27

    Flex RIA development known as the great advantage, but by my experience of 2 years to know RIA Development, Flex engage in fairly simple enterprise development is faster, but slightly more complex to die, flex components without the appropriate repor

  • DataGrid and AdvancedDataGrid open source tools copied to Excel 2010-12-27

    Flex RIA development known as the great advantage, but by my experience of 2 years to know RIA Development, Flex engage in fairly simple enterprise development is faster, but slightly more complex to die, flex components without the appropriate repor