Using postgreSQL + bamboo structures to facilitate more than N times the lucene full text search

2010-11-12  来源:本站原创  分类:Database  人气:143 

All packages are used to:

cmake-2.6.4.tar.gz (Code nlpbamboo use)

CRF + +-0.53.tar.gz (ibid.)

nlpbamboo-1.1.1.tar.bz2 (sub-word is used)

postgreSQL-8.3.3.tar.gz (index used)

Installing pgsql

tar-zxvf postgreSQL-8.3.3.tar.gz

cd postgre-8.3.3

. / Configure-prefix = / opt / pgsql

make
make install

useradd postgre

chown-R postgre.postgre / opt / pgsql
su - postgre
vi ~ postgre / .bash_profile
Add
export PATH
PGLIB = / opt / pgsql / lib
PGDATA = / data / PGSearch
PATH = $ PATH: / opt / pgsql / bin
MANPATH = $ MANPATH: / opt / pgsql / man
export PGLIB PGDATA PATH MANPATH

# Mkdir-p / data / PGSearch

# Chown-R postgre.postgre / data / PGSearch

# Chown-R postgre.postgre / opt / pgsql

# Sudo-u postgre / opt / pgsql / bin / initdb-locale = zh_CN.UTF-8-encoding = utf8-D / data / PGSearch

# Sudo-u postgre / opt / pgsql / bin / postmaster-i-D / data / PGSearch & / / allow network access

# Sudo-u postgre / opt / pgsql / bin / createdb kxgroup
# Vim / data / PGSearch / pg_hba.conf to increase access of the machine are as follows:

host all all 10.2.19.178 255.255.255.0 trust

# Su - postgre

$ Pg_ctl stop

$ Postmaster-i-D / data / PGSearch &
Install Chinese word (Cmake CRF + + bamboo)
Cmake is to build bamboo, CRF + + is a bamboo dependent.

tar-zxvf cmake-2.6.4.tar.gz

cd cmake-2.6.4
. / Configure
gmake
make install

tar-zxvf CRF + +-0.53.tar.gz
cd CRF + + -0.53
. / Configure
make
make install

tar-jxvf nlpbamboo-1.1.1.tar.bz2
cd nlpbamboo-1.1.1
mkdir build
cd build /
cmake ..-DCMAKE_BUILD_TYPE = release
make all
make install

cp index.tar.bz2 / opt / bamboo /
cd / opt / bamboo /
tar-jxvf index.tar.bz2

# / Opt / bamboo / bin / bamboo

If there is:

ERROR: libcrfpp.so.0: cannot open shared object file: No such file or directory

On the implementation of:

ln-s / usr / local / lib / libcrfpp.so .* / usr / lib /
ldconfig

Increase on the Chinese word extended to pgsql

# Vim / root / .bash_profile also increased:

PGLIB = / opt / pgsql / lib
PGDATA = / data / PGSearch
PATH = $ PATH: / opt / pgsql / bin
MANPATH = $ MANPATH: / opt / pgsql / man
export PGLIB PGDATA PATH MANPATH

# Source ~ /. Bash_profile

cd / opt / bamboo / exts / postgres / chinese_parser /
make
make install

su - postgre
cd / opt / pgsql / share / contrib /
touch / opt/pgsql/share/tsearch_data/chinese_utf8.stop
psql kxgroup
\ I chinese_parser.sql Import

And then execute the following sql, has a passage segmentation can be a:

SELECT to_tsvector ('chinesecfg', 'the results of the implementation of bamboo in the command line to know');

First here about the next part of the index on the TEXT fields and query, complete the construction of a search engine.

First, the Basics

The start back from a sql:

select * from dbname where field_name @ @ 'aa | bb' order by rank (field_name, 'aa | bb');

Explain the literal meaning of this sql: check this table from the dbname match aa or bb field_name words, and according to their matching RANK order.

Basically, after the above paragraph to understand, to learn four concepts: tsvector, tsquery, @ @, gin.

1. Tsvector:

In postgreSQL 8.3 comes with support for full-text search functions in the previous version needs to install and configure tsearch2 to use. It provides two data types (tsvector, tsquery), and documents through a dynamic set of natural language search, locate the best match of the query results, tsvector is one of them.

The value of a tsvector word is the only sub-category list, then a word to a different format for the entry, word processing during the hours when, tsvector segmentation automatically remove duplicate entries, according to a certain order into the . For example,

SELECT 'a fat cat sat on a mat and ate a fat rat':: tsvector;
tsvector
------------------
'A' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'

Tsvector to a string by word segmentation by white space, which can be divided according to the word after the number of occurrences of words in a row (also by word length).

For English and Chinese full-text search of this we depend on the following sql:

SELECT to_tsvector ('english', 'The Fat Rats');
to_tsvector
------
'Fat': 2 'rat': 3

to_tsvector function is tsvector normalized, in which points can be used by the specified word.

2. Tsquery:

As the name implies, tsquery, that should be the query-related. Tsquery entry is stored for retrieval. And can be combined using boolean operators to connect to, & (AND), | (OR), and! (NOT). Use of brackets (), can be forced into a group.

At the same time, tsquery doing the search, when you can use weights, and each word can use one or more weight marker, so that retrieval time, will match the same weight information. With the above tsvector the same, tsquery also has a to_tsquery function.

3. @ @:

Search match in postgresql operational use in the @ @ operator, if a
tsvector (document) matches tsquery (query) returns true.

Look at a simple example:

SELECT 'a fat cat sat on a mat and ate a fat rat':: tsvector @ @ 'cat & rat':: tsquery;
? Column?
----
t
When we deal with the index or to use their functions are as follows:
SELECT to_tsvector ('fat cats ate fat rats') @ @ to_tsquery ('fat & rat');
? Column?
----
t
And the operator can use the text as @ @ tsvector and tsquery. As the operator can make use of the method

tsvector @ @ tsquery
tsquery @ @ tsvector
text @ @ tsquery
text @ @ text
Above, we have used the first two, but the latter two,
text @ @ tsquery equivalent to_tsvector (x) @ @ y.
text @ @ text equivalent to_tsvector (x) @ @ plainto_tsquery (y ).(~) plainto_tsquery later say. . .

4.gin:

gin is the name of an index, full-text index used.

Gin we can create an index to speed up the retrieval speed. For example,

CREATE INDEX pgweb_idx ON pgweb USING gin (to_tsvector ('english', body));

There are several ways to create the index. Index to create even connect the two columns:
CREATE INDEX pgweb_idx ON pgweb USING gin (to_tsvector ('english', title | | body));

Second, improve the articles

Learn the basics done, and should battle, in order to achieve full-text search, we need to create a tsvector format a document, and implement the user's query by tsquery, we are in the query to return a query results in accordance with the order of importance.

Look at a to_tsquery the sql:

SELECT to_tsquery ('english', 'Fat | Rats: AB');
to_tsquery
------
'Fat' | 'rat': AB

It can be seen, to_tsquery function when dealing with the query text, query text between the words you want to use a single logical operators (& (AND), | (OR) and! (NOT)) connection (or use parentheses.)

If you do this sql error occurs the following:

SELECT to_tsquery ('english', 'Fat Rats');

plainto_tsquery function is to provide a standard tsquery, such as the above example, plainto_tsquery will automatically add the logic & operator.
SELECT plainto_tsquery ('english', 'Fat Rats');

plainto_tsquery
------
'Fat' & 'rat'
But plainto_tsquery function logical operators are not able to identify and weight markers.
SELECT plainto_tsquery ('english', 'The Fat & Rats: C');
plainto_tsquery
-------
'Fat' & 'rat' & 'c'

Third, the finale

After reading a bunch of the above, a thousand words come together to form a word, this paper is primarily concerned with a sql, in Canada about the first part of the expansion, use the following sql, word search from a field, but also Sorting out:

select * from tabname where to_tsvector ('chinesecfg', textname) @ @ plainto_tsquery ('search Diansha') order by ts_rank (to_tsvector ('chinesecfg', textname), plainto_tsquery ('search Diansha')) limit 10;

Before create table create index not write. Delegate to fish is the key.

相关文章
  • Using postgreSQL + bamboo structures to facilitate more than N times the lucene full text search 2010-11-12

    All packages are used to: cmake-2.6.4.tar.gz (Code nlpbamboo use) CRF + +-0.53.tar.gz (ibid.) nlpbamboo-1.1.1.tar.bz2 (sub-word is used) postgreSQL-8.3.3.tar.gz (index used) Installing pgsql tar-zxvf postgreSQL-8.3.3.tar.gz cd postgre-8.3.3 . / Confi

  • Using postgreSQL + bamboo structures lucene convenient than N times the full-text search 2010-11-12

    All packages are used to: cmake-2.6.4.tar.gz (Code nlpbamboo use) CRF + +-0.53.tar.gz (ibid.) nlpbamboo-1.1.1.tar.bz2 (word use) postgreSQL-8.3.3.tar.gz (index used) Installing pgsql tar-zxvf postgreSQL-8.3.3.tar.gz cd postgre-8.3.3 . / Configure-pre

  • PostgreSQL 8.3.1 Full Text Search (Full Text Search) 2010-11-12

    Transfer from: http://www.blogjava.net/agun/archive/2008/04/23/195086.html In postgreSQL 8.3 comes with support for full-text search functions in the previous version needs to install and configure tsearch2 to use, safety switch configuration tsearch

  • The expansion of full-text index PostgreSql Bamboo 2010-11-12

    http://code.google.com/p/nlpbamboo/

  • Introduction and how to use Lucene? 2011-05-23

    Lucene is a Java-based toolkit full-text index. The full text index engine based on Lucene Java Introduction: On the history of the author and Lucene Full text search implementation: Luene full-text index and database comparison of the index Chinese

  • postgresql in windows (including win7) installation and configuration under 2010-02-04

    First of all, the following is commonly used commands: ③ initdb initialize database. ④ pg_ctl start start the database ⑤ pg_ctl stop to stop the establishment of a database ⑥ post_svc-install windows nt service (need to set system environment variabl

  • Access to the database postgresql 2010-02-04

    Once you create a database, you can access it: Running PostgreSQL interactive terminal program, called psql, which allows you to interactively enter, edit, and implementation of SQL commands. Graphical front-end to use our existing tools, such as PgA

  • postgreSQL install a small mind (to) 2010-12-29

    Note: Transfer from saber7's blog Installation environment windows7 32-bit systems postgresql-9.0.1-1-windows-binaries ===== Before the first download a postgresql-8.4.2-1-windows version, this version is installed version, the installation process t

  • postgreSQL install a small note (rpm) 2010-12-29

    Note: Transfer from saber7's blog Installation environment windows7 32-bit systems postgresql-9.0.1-1-windows-binaries ===== Prior to first download a postgresql-8.4.2-1-windows version, this version is installed version, the installation process hav

  • PostgreSQL Linux installation, ease of use Manual 2011-01-16

    PostgreSQL Linux install and use, simple manual First, the configuration method: 1, first download any version of PostgreSQL For Linux X86_64 wget http://downloads.enterprisedb.com/postgresql/postgresql-8.4.1-1-linux-x64.bin 2, the text mode installa

  • [Transfer] PostgreSQL RPM Installation 2011-04-14

    Turn: http://wiki.openscg.com/index.php/PostgreSQL_RPM_Installation OpenSCG provides PostgreSQL 9.0 RPM packages for an easy installation experience on Redhat, Fedora, CentOS, OpenSUSE and similar systems. This tutorial explain how to install and con

  • PostgreSQL 9 data types 2011-04-24

    Switch to PostgreSQL on the agenda, and simple to do notes. From: http://www.postgresql.org/docs/9.0/interactive/datatype.html Name Aliases Description bigint int8 Signed 8-byte integer bigserial serial8 Since the increase in 8-byte integer bit [(n)]

  • 在windows下手动初始化PostgreSQL数据库教程 2014-07-12

    在windows下手动初始化PG,是一件比较麻烦的事,下面我具体写一下过程,大家做一下参考. 环境:win7 64 sp1 PG:9.3.5 1.创建用户postgres,密码同样是postgres: net user postgres postgres /add 2.在数据库根目录下建立data目录: C:\Program Files\PostgreSQL\9.3>md data 3.去掉administrator对data目录的权限: C:\Program Files\PostgreSQL\

  • PostgreSQL的全文检索(一) 2012-09-28

    在全文检索没有出来之前,普通的文件检索都是采用的like,~,或者ilike来匹配文档字段中内容,这种检索方法对小数据量的文本检索是OK的,但数据量大了就不行了. 普通检索的劣势: 1.语言不能完全支持,哪怕是英文,比如检索friend时不能检索出friends或者friendly 2.检索出的结果排序功能不好 3.缺少索引支持,查询速度慢,特别是两头加了两个%时根本就不走索引 PostgreSQL在8.3.x版本后开始支持全文检索.执行步骤,主要分三步走: 1.将文档分词(parsing do

  • 如何在windows下手动初始化PostgreSQL数据库 2014-09-18

    环境:win7 64 sp1 PG:9.3.5 1.创建用户postgres,密码同样是postgres: net user postgres postgres /add 2.在数据库根目录下建立data目录: C:\Program Files\PostgreSQL\9.3>md data 3.去掉administrator对data目录的权限: C:\Program Files\PostgreSQL\9.3>cacls data /e /t /r administrator 处理的目录: C

  • 2000个软件开发领域的高频特殊词及精选例句(一) 2015-03-19

    superword是一个Java实现的英文单词分析软件,主要研究英语单词音近形似转化规律.前缀后缀规律.词之间的相似性规律等等. 1.单词 hadoop 的匹配文本: Subash D'Souza is a professional software developer with strong expertise in crunching big data using Hadoop/HBase with Hive/Pig. Apache Flume Distributed Log Collect

  • Open Source Search Engine 2010-03-29

    Open source search engine for people to learn, study and master the search technology provides an excellent way to and material to promote the popularization and development of search technology, so that more and more people began to understand and p

  • Java Products and software download 2009-03-13

    This column provides hundreds of Java products and software download site link and brief introduction developerWorks Web site resources related technologies. Through this column, you can easily find you need Java tools, components and code, but also

  • Lucene3.0 Principle and Code Analysis 2010-04-09

    Lucene 3.0 Principles and Code Analysis: forfuture1978 http://forfuture1978.javaeye.com This series of articles will detail the latest version of Lucene is almost the basic principles and code analysis. http://www.javaeye.com - do share the best soft

  • Machine learning open source tools 2010-05-12

    Most of these tools are open source, based on GPL, Apache and other open-source protocol, using the tool, please read the license statement I. Information Retrieval 1. Lemur / Indri The Lemur Toolkit for Language Modeling and Information Retrieval ht