Lucene3.0 Principle and Code Analysis

2010-04-09  来源:本站原创  分类:Tech  人气:414 

Lucene 3.0 Principles and Code Analysis: forfuture1978 http://forfuture1978.javaeye.com
This series of articles will detail the latest version of Lucene is almost the basic principles and code analysis.
http://www.javaeye.com - do share the best software development community 1 / 199 books this book provided by the JavaEye DIY feature automatically generated on 2010-02-22
Directory
1. Lucene study concluded
1.1 Lucene one study concluded: Full text search of the basic principles of...................................... ....... 3
The second study concluded 1.2 Lucene: Lucene's overall structure....................................... ...... 21
The third study concluded 1.3 Lucene: Lucene index file format (1)................................... ... 25
The third study concluded 1.4 Lucene: Lucene index file format (2)................................... ... 33
The third study concluded 1.5 Lucene: Lucene index file format (3)................................... ... 59
The fourth study concluded 1.6 Lucene: Lucene indexing process analysis (1).................................... ..... 70
The fourth study concluded 1.7 Lucene: Lucene indexing process analysis (2).................................... ..... 89
The fourth study concluded 1.8 Lucene: Lucene indexing process analysis (3).................................... .... 115
The fourth study concluded 1.9 Lucene: Lucene indexing process analysis (4).................................... .... 130
2. The issue of Lucene
Issues related to Lucene 2.1 (1): Why can search for the "AND of the Republic of China" is not found, "the Republic of China"?........ 159
Issues related to Lucene 2.2 (2): stemming and lemmatization..................................... 163
Issues related to Lucene 2.3 (3): Lucene's vector space model and scoring system............................... 171
Issues related to Lucene 2.4 (4): impact of Lucene scoring four ways a document............................... .175
http://forfuture1978.javaeye.com
2 / 199
1.1 Lucene one study concluded: the basic principles of full-text search Posted: 2009-12-11
The location of this csdn http://blog.csdn.net/forfuture1978/archive/2009/10/22/4711308.aspx
1, General under http://lucene.apache.org/java/docs/index.html definition:
Lucene is an efficient, full-text search library based on Java.
Therefore, to understand the Lucene fee prior to work to find out some full-text search.
So, what is called the full-text search it? This should start with the data in our lives.
The data in our lives in general are divided into two types: structured data and unstructured data.
• structured data: with fixed or limited length of the data format, such as databases, metadata and so on.
• Unstructured data: Zhi Buding long or have no fixed format data such as email, word documents and so on.
Of course, some places also mentioned a third, semi-structured data, such as XML, HTML, etc., when necessary, to deal with structured data can also be extracted to plain text by processing unstructured data.
Called unstructured data, text data, yet another it is called.
According to data classification, search also divided into two kinds:
• The structured data search: If the database search, using SQL statements. Another example is the search of metadata, such as the use of windows search on the file name, type, modification time for searching.
• search on unstructured data: such as the use of search windows can also search the document, Linux under the grep command, another example with
Google and Baidu to search a large number of content data.
On the unstructured data that is on the search text data There are two main ways:
One is the sequential scanning method (Serial Scanning): the so-called sequential scanning, such as a string to find content that contains a file, a document is a document of view, for each document from scratch to see the end, if this document contains the string, this document is a document we are looking for, then see next file until all files scanned. If using windows file search can also search the content, but quite slow. If you have a 80G hard drive, if you want to find a content in the above document contains a string, he does not spend a few hours, I'm afraid can not do. Under Linux
grep command is this a way. You may find this more primitive method, but for a small data file, this method is the most direct,
The most convenient. But for a large number of files, this method very slow.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of 3 / 199
Some may say that, on the order of scanning is very slow unstructured data, structured data on the search is relatively fast (as structured data has some structure, search algorithm can take a certain speed), then our non-structural data to find ways to have a certain structure is not confused on the line yet?
This idea is very natural, but constitute a full-text search of the basic ideas, but also about unstructured data as part of the information is extracted, re-organization,
Make it have a certain structure, then the data has some structure to search, so as to achieve the purpose of search is relatively fast.
This part of the extract from a unstructured data out and then re-organization of information, we call the index.
This argument is abstract, some examples it is easy to understand, such as dictionaries, dictionaries of spelling forms and radicals of Chinese characters is equivalent to the dictionary index
For each word of explanation is unstructured, and if there is no dictionary of syllables and radicals of Chinese characters, in the vast Cihai in order to find a word can only scan.
However, some words can be extracted structured information processing, such as pronunciation, relatively structured, sub-consonants and vowels, respectively, only a few can be enumerated, so the pronunciation out by a certain order, each pronunciation of the word point to a detailed explanation of this page. We search by structured phonetic pronunciation seized, then the number of pages according to their point, you will find our unstructured data - that is the interpretation of the word.
This first index, then the process of searching the index is called Full Text Search (Full-text Search).
The picture below from "Lucene in action", it not only describes the Lucene search process, but rather describes the full-text search of the general process.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 4 / 199 full-text search in general two processes, the index creation (Indexing) and search the index (Search).
• Index creation: the real world all the structured and unstructured data to extract information, create index process.
• Search Index: a user's query request is received, search the index created, and then return the result of the process.
So there are three full-text search on the important issues:
1. Index actually exist inside what? (Index)
2. How do I create an index? (Indexing)
3. How to search the index? (Search)
The following order for each problem that we study.
Second, the index which actually exist inside what really need to keep the index to?
First we look at why the slow sequential scanning:
In fact, because we want to search for information and unstructured information stored data inconsistency caused.
Unstructured information stored data is what each file contains a string that is known documents, desires string is relatively easy, that is, the mapping from file to string. We would like to search for information on what file contains this string, that is known to string, desires documents, namely the mapping from a string to a file. Both the contrary. So if the index can always save the mapping from a string to a file, it will greatly improve the searching speed.
Since the mapping from a string to a file is the file to reverse the process of mapping a string, then save this information, an index known as the inverted index.
Reverse index information kept by the general as follows:
Suppose I have 100 document collection which documents, in order to facilitate that, we have a document number from 1 to 100, have the following structure
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of 5 / 199 on the left is saved in a series of strings, called the dictionary.
Each string is pointing to a document that contains this string (Document) list, this document is called the inverted list table (Posting List).
With the index, it makes the stored information and to search for information on line, we can greatly accelerate the search speed.
For example, we have to look not only contains the string "lucene" also contains the string "solr" document, we only need the following steps:
1. Remove contain the string "lucene" the document list.
2. Remove contain the string "solr" in the document list.
3. Through the merger of the list to find out not only contain "lucene" also contains "solr" document.
See this place, someone might say, full-text search does speed up the search speed, but over the course of the index, the two do not necessarily add up much faster than sequential scan. Indeed, with the process of indexing, full text retrieval is not necessarily faster than the sequential scan, especially in the data, especially when a small amount. While a very large amount of data to create the index is a very slow process.
However, there are differences between the two, sequential scan is scanned every time, but the process of creating an index requires only once, after that once and for all, and every search, not after the process of creating the index, just create a good index search on it.
This is the full-text search with one of the advantages for the sequential scan: an index number to use.
Third, how to create an index
Full Text Search index creation process generally has the following steps:
The first step: a number of original documents to be indexed (Document).
To facilitate the explanation process of creating an index, where the two documents intended to be used as an example:
File 1: Students should be allowed to go out with their friends, but not allowed to drink beer.
File 2: My friend Jerry went to school to see his students but found them drunk which is not
allowed.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of 6 / 199 second step: the original documents passed to sub-sub-components (Tokenizer).
Word component (Tokenizer) do the following things (this process is called Tokenize):
1. Will be a separate document into a word.
2. Removal of punctuation.
3. Removal of stop words (Stop word).
The so-called stop words (Stop word) is a language of some of the most common words, since no special significance, which in most cases can not be searched for key words, which create the index, this word will be removed and the reduced The size of the index.
English very word (Stop word) such as: "the", "a", "this" and so on.
For each word of a language component (Tokenizer), there is a stop word (stop word) collection.
After Word (Tokenizer) the results obtained as the word element (Token).
In our example, we get the following words yuan (Token):
"Students", "allowed", "go", "their", "friends", "allowed", "drink", "beer",
"My", "friend", "Jerry", "went", "school", "see", "his", "students",
"Found", "them", "drunk", "allowed".
The third step: to get the word element (Token) passed to language processing components (Linguistic Processor).
Language processing component (linguistic processor) is to be the main word Element (Token) do something with language-related processing.
For English, language processing components (Linguistic Processor) generally do the following:
1. Into lowercase (Lowercase).
2. Will be reduced as the root form of words, such as the "cars" to the "car" and so on. This operation is called: stemming.
3. The word into a root form, such as "drove" to "drive" and so on. This operation is called: lemmatization.
Stemming and lemmatization the similarities and differences:
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of 7 / 199 • common: Stemming and lemmatization have a root form of word.
• two different ways:
◦ Stemming used the "reduction" approach: "cars" to the "car", "driving" to "drive".
◦ Lemmatization used the "change" means: "drove" to "drove", "driving" to "drive".
• two different algorithms:
◦ Stemming mainly a fixed algorithm to do this reduction, such as the removal of "s", remove the "ing" plus "e", the "ational" into "ate", the "tional" to " tion ".
◦ Lemmatization mainly kept a dictionary by the way to do this change. Such as dictionary of "driving" to "drive", "drove" to "drive", "am, is, are" to "be" the map, make changes, as long as the dictionary on it.
• Stemming and lemmatization are not mutually exclusive relationship, there is intersection, and some words using these two methods can achieve the same conversion.
Language processing component (linguistic processor) is referred to as words (Term).
In our case, through language processing, get the word (Term) as follows:
"Student", "allow", "go", "their", "friend", "allow", "drink", "beer", "my",
"Friend", "jerry", "go", "school", "see", "his", "student", "find", "them",
"Drink", "allow".
It is also because language processing steps, can the search drove, the drive can also be searched for them.
Step four: will be the words (Term) passed to the index components (Indexer).
Index Components (Indexer) mainly to do the following things:
1. Use by the word (Term) to create a dictionary.
In our case, the dictionary is as follows:
Term Document ID
student 1
allow 1
go 1
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 8 / 199
their 1
friend 1
allow 1
drink 1
beer 1
my 2
friend 2
jerry 2
go 2
school 2
see 2
his 2
student 2
find 2
them 2
drink 2
allow 2
2. On the dictionary are sorted in alphabetical order.
Term Document ID
allow 1
allow 1
allow 2
beer 1
drink 1
drink 2
find 2
friend 1
friend 2
go 1
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of 9 / 199
go 2
his 2
jerry 2
my 2
school 2
see 2
student 1
student 2
their 1
them 2
3. Merge the same word (Term) a document inverted (Posting List) list.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 10 / 199 In this table, there are several definitions:
• Document Frequency is the document frequency, total number of documents that contain this term (Term).
• Frequency that word frequency, indicating that this file contains a number of the term (Term).
So on the word (Term) "allow" is concerned, a total of two documents containing the term (Term), and thus the word (Term) behind the document lists a total of two, the first that contains the "allow" the first documents, namely No. 1 document, this document, "allow" appeared two times, the second that contains the "allow" a second document, the document is on the 2nd, this document, "allow" appeared 1.
This point, the index has been created, for what we can find it quickly we want to document.
And in the process, we are surprised to discover, search for "drive", "driving", "drove", "driven" can be seized. Because in our index, "driving", "drove", "driven" through language processing will become "drive", in the search, if you type "driving", entered the same query through one of us to the three steps to become a check "drive", so you can search for desired documents.
Third, how to search the index?
Here it seems we can declare, "We want to document the find."
But things did not end, full-text search to find just one aspect. Is not it? If the only document that contains only one or 10 strings of our inquiry, we do find. However, if the results are a thousand or even tens of thousands of bands? That is your most desired document?
Open the Google bar, for example, you want to get a job at Microsoft, so you type "Microsoft job", You found a total of 22.6 million results returned. Good large numbers ah, suddenly found could not find a problem, too much is also a problem to find. In so many results, how does the most relevant on the front?
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principle of 11 / 199, of course Google has done so well, you look to find jobs at Microsoft. Imagine if the first few are all "Microsoft does
a good job at software industry ... "What a terrible thing to ah.
How like Google, in the tens of thousands of search results, find and query the most relevant?
How to determine the search query from the documents and the relevance of it?
This return to our third question: how to search the index?
Search consists of the following steps:
The first step: the user input query.
Query the same with our common language, but also a certain syntax.
Different queries have different syntax, such as the SQL statements have a certain syntax.
The syntax of the query text retrieval system according to differ. There are such basic: AND, OR, NOT, etc..
For example, user input statement: lucene AND learned NOT hadoop.
Help users find a lucene and learned, however, contains not include hadoop document.
The second step: the query based lexical analysis, syntax analysis, and language processing.
As the query syntax, which must be parsed, parsing and language processing.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 12 / 199
1. Lexical analysis to identify key words and keywords.
If the above example, through lexical analysis, word has lucene, learned, hadoop, keywords AND, NOT.
If the lexical analysis of keywords found illegal, will be an error. If lucene AMD learned, which misspelled as AND, resulting in
AMD to participate as an ordinary word query.
2. Grammatical analysis is based mainly query syntax rules to form a syntax tree.
If you find that the query does not satisfy the rules of grammar, it will error. If lucene NOT AND learned, will go wrong.
If the above example, lucene AND learned NOT hadoop formed syntax tree is as follows:
3. Language in the process of dealing with the index almost the same language processing.
Such as learned into learn.
After the second step, we get a deal through the language syntax tree.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 13 / 199 Third Step: Search index, are consistent with the syntax tree of the document.
This step took part in several small steps:
1. First of all, in the inverted index table, were identified include lucene, learn, hadoop the document list.
2. Secondly, contains lucene, learn the list in a joint operation, are both included lucene document also contains lists learn.
3. Then, this list of documents with the hadoop linked to poor operation to remove the document contains the hadoop to get both contain lucene
Also includes learn and the documentation does not contain hadoop list.
4. This document lists the document we are looking for.
The fourth step: According to the received document and query the relevance of the results sorted.
While in the previous step, we get the desired document, however, the results for the query and the query should be based on the relevance ranking, the more stakeholders more front.
How to calculate the relevant documents and queries as well?
As we query as a short document, the document and the correlation between documents (relevance) for scoring (scoring), good correlation with high scores, it should be listed first.
So how can the relationship between the document rate it?
This is not an easy task, we first look at the relationship between the right to judge.
First look at a person, often a lot of factors, such as personality, beliefs, hobbies, clothing, height, weight, and so on.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 14 / 199 second for the relationship between people, the importance of the different elements of different character, faith, love may be important some, clothing, height, fat or thin may not be so important, so with the same or similar character, faith, love of people more likely to become good friends, but the clothing, height, different people fat or thin, it can be a good friends.
Thus determine the relationship between people, we should first find out which elements of the most important human relationships, such as character, faith and love. Second, the two people to judge the relationship between these elements, such as one character, cheerful, outgoing personality of another person, one Buddhist, one belief in God, one love to play basketball, play football another hobby. We found that two people are very active in character, faith is very kind, loving aspects of love movement, so we should be very good relationship between two people.
Let us look at the relationship between the companies it.
First look at a company, there are many members, such as the general manager, manager, chief technology officer, general staff, security guards and so on.
Second, the relationship between companies, the importance of different people with different, general manager, manager, chief technology officer, may be more important than others, general staff, security guards may be more important point. So if the two general manager, manager, chief technology officer, better relations between the two companies likely to have better relations. However, even with an ordinary employee of another company, an ordinary employee has lasting enmity, fear can hardly affect the relationship between the two companies.
Thus determine the relationship between companies, we must first find out who is on the relationship between the company and the most important companies, such as general manager, manager, chief technology officer. Second, we must determine the relationship between these people, as general manager of the two companies had been classmates, fellow managers, chief technology officer, was a venture partner. We found that the two companies, whether general manager, manager, chief technology officer, relations are good, and therefore should be good relations between the two companies.
Analysis of the two kinds of relationships, the following look at how to determine the relationship between the documents.
First, a document many words (Term) composition, such as search, lucene, full-text, this, a, what, etc..
Second, the relationship between the documents, the importance of different Term of different, such as for this document, search, Lucene, full-text on the relative importance of some, this, a, what may be relatively unimportant number. So if two documents are included search, Lucene, fulltext, the relevance of these two documents better, but even if a document contains this, a, what, another one document does not contain this, a, what, does not affect the relevance of two documents.
Thus determine the relationship between documents, the first to find out what the word (Term) on the relationship between the most important documents, such as search, Lucene, fulltext. Then check these terms (Term) relations.
Find word (Term) on the importance of the document is called the weight of the word calculation (Term weight) process.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: Full text search of the basic principles of Article 15 / 199 calculate the weight of the word (term weight) has two parameters, the first is the term (Term), the second the document (Document).
Term weight (Term weight) that this term (Term) in the importance of this document, the more important words (Term) have greater weight (Term
weight), and thus in the calculation of the correlation between the document will play a bigger role.
Predicate (Term) relations between the relevant documents to get the process of applying a known vector space model algorithm (Vector Space
Model).
Following careful analysis of these two processes:
1. Calculated weight (Term weight) process.
Affect a term (Term) in a document the importance of two main factors:
• Term Frequency (tf): that the Term of this document appeared in the number of times. tf bigger shows and more important.
• Document Frequency (df): the number of documents that contain views Term. Note the more important the larger df.
Easy to understand? Term (Term) the number of occurrences in the document more shows this term (Term) of the more important documents, such as "search" the word appears in the document many times, the present document is mainly about this area things. However, in an English document, this number appears more
To explain the more important? No, this is adjusted by the second factor, the second factor shows that the more the document contains the term (Term), explain this term (Term) is too general, not sufficient to distinguish these documents, so the more the importance of low.
This, as we learned the technical programmers, for programmers themselves, the technology to master the harder the better (note to grasp the deeper the more you take the time to look at, tf greater), looking for work, the more competitive . But for all the programmers, who understand the technology as little as possible (fewer people know df
Small), find a job more competitive. The value of people is irreplaceable is the truth.
Reason to understand, we look at the formula:
This only term weight typical realization of a simple formula. Full-text retrieval system to achieve their own people will realize, Lucene on with this slightly different.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 16 / 199
2. Determine the relationship between the Term to obtain documents related to the process, ie, vector space model algorithm (VSM).
We as a word document (Term), every word (Term) has a weight (Term weight), a different word (Term) based on their weight in the document to influence the document relevance scoring calculation.
So we put all of this document the word (term) weight (term weight) as a vector.
Document = (term1, term2, ... ..., term N)
Document Vector = (weight1, weight2, ... ..., weight N)
Similarly, we query as a simple document, but also with the vector to express.
Query = (term1, term 2, ... ..., term N)
Query Vector = (weight1, weight2, ... ..., weight N)
We all search for a document vector and query vector into an N-dimensional space, each word (term) is one-dimensional.
Figure:
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 17 / 199 We believe that the smaller the angle between two vectors, the greater the relevance.
Therefore, we calculate the cosine of the angle rate as the correlation, the smaller the angle, the greater the cosine value, the higher the rate, the greater the relevance.
One might ask, are generally very short query, contains the words (Term) is very small, so the query vector dimension is small, but the document is very long,
Contains the word (Term) a lot of great dimension document vector. Your figure how the two dimensions are N then?
Here, as we have to put the same vector space, the natural dimension is the same, not at the same time, and take both sets, if without a word (Term)
, Then the weight (Term Weight) to 0.
Relevance scoring formula is as follows:
For example, query 11 Term, a total of three out of the document search. Which the respective weights (Term weight), the following table.
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
D1 0 0 .477 0 .477 .176 0 0 0 .176 0
D2 0 .176 0 .477 0 0 0 0 .954 0 .176
D3 0 .176 0 0 0 .176 0 0 0 .176 .176
Q 0 0 0 0 0 .176 0 0 .477 0 .176
Thus calculated, three documents with the rate the relevance of the query are:
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 18 / 199 was the second most relevant documents, to return, followed by a document, and, finally, the document 3.
This point, we can find the documentation of our most wanted.
Said so much, in fact, had not yet entered into Lucene, Er Jin Jin yes Information Retrieval (Information retrieval) the basic theory, but when we read we will find that after Lucene, Lucene yes right Zhezhong the basic theory of a fundamental practice. Therefore, after analysis of Lucene's article, will often see the above theory in Lucene application.
Before entering the Lucene on the index and search process to create a summary, as shown:
This picture light http://www.lucene.com.cn/about.htm in the article "open source text search engine Lucene"
1. Indexing process:
1) a series of indexed files
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 19 / 199
2) The index file parsing and language processing through the formation of a series of words (Term).
3) After the formation of the index dictionary creation and reverse index table.
4) Write the index by the index stored in the hard disk.
2. The search process:
a) user input query.
b) to the query after parsing and linguistic analysis are a series of words (Term).
c) get through the syntax of a query tree.
d) store will be indexed by index read into memory.
e) the search index using the query tree to get each word (Term) of the document list, the document list for intersection, difference, and get the result document.
f) the results of the search to document the relevance of the query to sort.
g) return query results to the user.
Now we can enter the Lucene's world.
http://forfuture1978.javaeye.com 1.1 Lucene one study concluded: the basic principle of Search 20 / 199
The second study concluded 1.2 Lucene: Lucene's overall architecture Posted: 2009-12-11
The location of this csdn http://blog.csdn.net/forfuture1978/archive/2009/10/30/4745802.aspx
Lucene general is:
• an efficient, scalable, full-text search library.
• Full use of Java to achieve, without configuration.
• only supports plain text index (Indexing) and search (Search).
• is not responsible for the file format selected by the other plain text files, or grab files from the network process.
In Lucene in action in, Lucene framework and process as below,
Description Lucene indexing and search are two processes, including index creation, index, search for three points.
Let us be more detailed look at some of the various components of Lucene:
http://forfuture1978.javaeye.com 1.2 Lucene study summary of 2: Lucene general structure 21 / 199 • the document has been indexed with the Document object representation.
• IndexWriter addDocument by function, the document added to the index and achieve the process of creating the index.
• Lucene index is the application of reverse index.
• When users request, Query on behalf of the user's query.
• IndexSearcher through the search function search Lucene Index.
• IndexSearcher calculating term weight and the score and the result is returned to the user.
• returned to the user's document collection with TopDocsCollector said.
So how to use these components do?
Let us detail the implementation of the Lucene API calls to index and search process.
http://forfuture1978.javaeye.com 1.2 Lucene study summary of 2: Lucene general structure 22 / 199 • indexing process is as follows:
◦ create an IndexWriter to write an index file, it has several parameters, INDEX_DIR index file is stored in the location, Analyzer is used to document lexical analysis and language processing.
◦ Create a Document to index documents on our behalf.
◦ Field will be different to the document. We know that a document has a variety of information such as title, author, modification time and content. Different types of information to a different Field said, in this case, a total of two types of information in the index, one is the file path, a file contents. One FileReader of SRC_FILE says that the source file to the index.
◦ IndexWriter addDocument call function writes the index to the index folder.
• Search process is as follows:
◦ IndexReader to read index information on the disk into memory, INDEX_DIR is the index file storage location.
◦ create IndexSearcher ready to search.
◦ create Analyer used to query for lexical analysis and language processing.
◦ create QueryParser used to query parsing.
◦ QueryParser call the parser for syntax analysis, the formation of query syntax tree into the Query in.
http://forfuture1978.javaeye.com 1.2 Lucene study summary of 2: Lucene general structure 23 / 199 ◦ IndexSearcher call the search query Query syntax tree search, get results
TopScoreDocCollector.
The above is a simple function called Lucene API.
However, when access to Lucene's source code and found a lot of Lucene package relations are complicated.
But by the next chart, we can easily find, Lucene all source modules are for general index and an implementation of the search process.
This map is presented on a full-text search of the process to achieve the corresponding Lucene package structure. (Cf. http://www.lucene.com.cn/
about.htm in the article "open source text search engine Lucene")
• Lucene in the analysis module is mainly responsible for lexical analysis and language processing and the formation of Term.
• Lucene's index module is mainly responsible for the creation of the index, which has IndexWriter.
• Lucene to store the index module is mainly responsible for reading and writing.
• Lucene in QueryParser responsible for parsing.
• Lucene's search module is mainly responsible for the search index.
• Lucene's similarity module is mainly responsible for the realization of relevance scoring.
Understanding of the entire structure of Lucene, we will start a tour Lucene's source.
http://forfuture1978.javaeye.com 1.2 Lucene study summary of 2: Lucene general structure 24 / 199
The third study concluded 1.3 Lucene: Lucene index file format (1)
Published: 2009-12-11
The location of this csdn http://blog.csdn.net/forfuture1978/archive/2009/12/10/4981893.aspx
Lucene index which saved what, how to store, ie, Lucene index file format, is read a key source Lucene.
When we truly enter into the Lucene source code, we will find:
• Lucene's indexing process is in accordance with the full-text search of the basic process will be written in inverted form the process of this file format.
• Lucene search process, that is, according to an index into this file format information read out, and then calculate each document scoring (score) process.
This paper interpretation of the Apache Lucene - Index File Formats (http://lucene.apache.org/java/2_9_0/
fileformats.html) this article.
First, the basic concept of the index map is generated by a Lucene instance:
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
25 / 199
Lucene's index structure is hierarchical, the main points the following levels:
• Index (Index):
◦ in a Lucene index is placed in a folder.
◦ image above, the same folder all the documents constitute a Lucene index.
• segment (Segment):
◦ An index can contain multiple paragraphs, paragraph and paragraphs are independent, and add a new section of the document can generate new and different segments could be merged.
◦ image above, with the same prefix files are the same paragraph, two paragraphs Figure CPC "_0" and "_1."
◦ segments.gen and segments_5 is the segment metadata file, ie they preserve the property information section.
• Document (Document):
◦ document is the basic unit of our construction of the index, different documents are saved in a different segment, a segment can contain multiple documents.
◦ newly added documents are kept in a separate section of the new generation, with the segment merging different documents into a single paragraph.
• Domain (Field):
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
26 / 199 ◦ a document contains different types of information can be separated from the index, such as title, time, text, author, etc., can be stored in a different domain in.
◦ index in different ways in different domains, the domain analysis in the real store, we will carefully interpret.
• term (Term):
◦ index term is the smallest unit is the result of lexical analysis and language processing after the string.
Lucene index structure, which preserves a positive message, and they have the reverse information.
The so-called positive information:
• Save at different levels from the index until the words of inclusion relations: Index (Index) -> segment (segment) -> document (Document) ->
Field (Field) -> term (Term)
• that is included in this index that section, each section contains those documents, each document contains those domains, each domain includes those words.
• Since it is a hierarchy, each level is kept up this level of information and the next level of meta-information, ie, attribute information, such as a book on China's geography, should be first introduced the general situation of Chinese geography, and China including the number of provinces, each province the province introduced the basic situation and contain the number of cities, each city and introduce the city's basic profile and contains the number of counties, each county specific description of the specific circumstances of each county.
• shown above, the documents contain positive information:
◦ segments_N save this index contains the number of segments, each segment contains the number of articles document.
◦ XXX.fnm save the paragraph contains the number of domains, each domain name and indexed.
◦ XXX.fdx, XXX.fdt preserve all documents contained in this paragraph, each document contains a number of domains, each domain holds that information.
◦ XXX.tvx, XXX.tvd, XXX.tvf save the paragraph that contains the number of documents, each document contains a number of domains, each domain contains a number of words, each word string, location and other information.
The so-called reverse information:
• Save the dictionary to the inverted form of the map: the word (Term) -> document (Document)
• shown above, the documents contain information on the reverse:
◦ XXX.tis, XXX.tii save the dictionary (Term Dictionary), that is this paragraph that contains all of the words by dictionary sort order.
◦ XXX.frq save the inverted form, ie, a document ID that contains a list of each word.
◦ XXX.prx inverted table holds each word in the word document contains this position.
Lucene index in understanding the detailed structure before you take a look at Lucene index in the basic data types.
Second, the basic types of
Lucene index file, used to store information about basic types:
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
27 / 199 • Byte: the most basic type, length 8 bits (bit).
• UInt32: composed by 4 Byte.
• UInt64: composed by a 8 Byte.
• VInt:
◦ variable-length integer type, it may contain multiple Byte, Byte 8-bit for each of them said after the 7 value, the highest one that if there is another Byte, 0 means no, 1 indicated.
◦ The more that the previous value of the low Byte, Byte, said the more the back of high value.
◦ such as 130 into a binary 1000, 0010, a total of eight required a Byte that can not, therefore needed to represent the two Byte, Byte, said after the first seven, and at the highest position 1 to indicate there is a still Byte, So is (1)
0000010, Byte, said second section 8, and the highest position of 0 to indicate no other Byte behind, so is (0)
0000001.
• Chars: UTF-8 encoding is a Byte.
• String: a string is a VInt first to indicate that this string contains the number of characters, then that is UTF-8 encoded character sequence of Chars.
Third, the basic rules
Lucene stores the information in order to make the space occupied by smaller, faster access and take some special skills, but looking at Lucene file format, these techniques are easily confused us, it is necessary to put these special introduce techniques to extract rules.
Next Fucai, casually to a few names from these rules is to facilitate the application of these rules, when followed can simply wrong with the understanding please.
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
28 / 199
1. Prefix suffix rules (Prefix + Suffix)
Lucene index in reverse, to save the dictionary (Term Dictionary) of information, all of the words (Term) in the dictionary are arranged in order according to the dictionary, but dictionaries in the document included almost all of the words, and some or very long term, this index file will be very large, so-called rules of prefixes and suffixes, word, and before that when a common prefix of a word when the word back in word prefixes only save the offset (offset ),
And outside addition to the prefix string (as suffix).
Example, the following words to memory: term, termagancy, termagant, terminal,
If the normal way to store, space requirements are as follows:
[VInt = 4] [t] [e] [r] [m], [VInt = 10] [t] [e] [r] [m] [a] [g] [a] [n] [c] [y], [VInt = 9] [t] [e] [r] [m] [a] [g] [a] [n] [t], [VInt
= 8] [t] [e] [r] [m] [i] [n] [a] [l]
Require a total of 35 Byte.
If the application of prefixes and suffixes, we need the space as follows:
[VInt = 4] [t] [e] [r] [m], [VInt = 4 (offset)] [VInt = 6] [a] [g] [a] [n] [c] [y], [VInt = 8 (offset)] [VInt = 1] [t],
[VInt = 4 (offset)] [VInt = 4] [i] [n] [a] [l]
Require a total of 22 Byte.
Much smaller storage space, especially in the sort order by the case of the dictionary, the prefix of coincidence rate greatly increased.
2. The difference rule (Delta)
Reverse in the Lucene index, you need to save a lot of integer numbers of information, such as document ID number, such words (Term) position in the document and so on.
From the above description, we know that the format of integer numbers is VInt stored. As the value increases, the number occupied by each Byte has gradually increased the number. The so-called margin rules (Delta) is the two integers have saved time, behind and in front of integer integer only save the difference can be.
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
No. 29 / 199 for example to be stored as an integer: 16386,16387,16388,16389
If the normal way to store, space requirements are as follows:
[(1) 000, 0010] [(1) 000, 0000] [(0) 000, 0001], [(1) 000, 0011] [(1) 000, 0000] [(0) 000, 0001], [(1) 000
0100] [(1) 000, 0000] [(0) 000, 0001], [(1) 000, 0101] [(1) 000, 0000] [(0) 000, 0001]
Supply 12 Byte.
If the difference between the rules applied to storage space requirements are as follows:
[(1) 000, 0010] [(1) 000, 0000] [(0) 000, 0001], [(0) 000, 0001], [(0) 000, 0001], [(0) 000 0001 ]
Require a total of six Byte.
Much smaller storage space, and whether the document ID, or the location of words in the document are in accordance with the order from small to large, gradually increasing.
3. Or natural to follow the rules (A, B?)
Lucene index in the structure of such a situation, there may be some value back to a value of A B, or may not need a flag to indicate whether follow behind B.
Under normal circumstances, the A back to place a Byte, the back does not exist for the 0 B, there is a behind the B, or 0 back there B, 1 The back does not exist B.
But this should be a waste of a Byte of space, in fact, a Bit on it.
In Lucene in the following way: A value left one, empty out the last one, as a flag to indicate whether follow behind B, so in this case, A / 2 A true original value.
If you read Apache Lucene - Index File Formats in this article, you will find a lot of compliance with the rule:
• . Frq file DocDelta [, Freq?], DocSkip, PayloadLength?
• . Prx file PositionDelta, Payload? (But not exclusively, in the following table analysis)
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
30 / 199, of course there are some band? But does not belong to this rule:
• . Frq file SkipChildLevelPointer?, Is a multi-storey jump table, the table pointer points to the next level, of course, if the last layer, this value does not exist, do not need to sign.
• . Tvf file Positions?, Offsets?.
◦ In such cases, with? The value exists, does not depend on the value of the last front.
◦ but on Lucene's a configuration, of course, these configurations are stored in the Lucene index files.
◦ such as Position and Offset is stored, depending. Fnm file in the configuration for each domain
(TermVector.WITH_POSITIONS and TermVector.WITH_OFFSETS)
Why is there both cases, in fact, is understandable:
• For those following the rules of probability is that for every A, B if there are not the same, as this abound when one Bit from a Byte to 8 times the space that is well worth saving.
• However, do not meet or follow the rules, because the existence of a value allocation for the entire field (Field) or even the entire index are valid and not every situation is different, and therefore can be integrated with a flag store .
Article follows the format of the description of the puzzling:
Positions -> <PositionDelta,Payload?> Freq
Payload -> <PayloadLength?,PayloadData>
PositionDelta and Payload are applicable contingent follow rules? How to identify PayloadLength existence?
Payload not in fact PositionDelta and follow the rules of probability, Payload existence is. Fnm file configuration for each domain related
Payload configuration decisions (FieldOption.STORES_PAYLOADS).
When Payload does not exist, PayloadDelta itself to follow the principle of non-compliance or contingent.
When Payload exists, the format should become as follows: Positions -> <PositionDelta,PayloadLength?,PayloadData> Freq
Thus PositionDelta and PayloadLength with applicable contingent to follow the rules.
4. Jump table rules (Skip list)
To improve the search performance, Lucene in many places to take the jump table data structure.
Jump table (Skip List) is shown a data structure has the following basic features:
• elements are in sequence, in Lucene, or in order according to the dictionary, or by order from small to large.
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
31 / 199 • jump is the interval (Interval), ie, the number of elements for each jump, is pre-configured intervals, jump table interval in Figure 3.
• jump table is from the level (level), every specified interval of each layer on layer of the element composition, as shown jumping contains 2 layers.
Point to note is that, in many data structures or algorithms will have a jump table book description, the principle is similar, but slightly different definitions:
• The interval (Interval) of the definition: Figure, some think that interval is 2, that is, between the two top elements of the number of elements, not including the two top elements; some considered to be 3, that the two top elements the difference between, including the back of the upper element, not including the previous top element; some considered to be 4, that is, except the elements between the two top elements of things, both in front, but also behind the upper element. Lucene is to take the latter definition.
• on the level (Level) of the definition: Figure, some think we should include the original chain surface, and start counting from 1, the total level of 3, 1,
2,3 layer; some think should be included in the original chain surface, and counting from 0 for 0,1,2 layer; some that should not include the original chain surface, and start counting from 1, compared with 1,2 layer; Some believe that should not include chain surface, and start counting from 0, compared to 0,1 layer. Lucene is the last one to define.
Jump table than the sequential search, which greatly improved the search speed, such as search element 72, had to visit 2,3,7,12,23,37,39,44,
50,72 A total of 10 elements in jump table, as long as the first visit to layer 1 50, found that more than 72, 50, and Layer 1 no next node, then visit the first two layers 94, 94, was found more than 72 , and then visit the original list of 72, find elements, a total of three elements need to access can be.
However, Lucene specific implementations, and the theory is different from that in the specific format, will be detailed.
http://forfuture1978.javaeye.com 1.3 Lucene The third study concluded: Lucene index file format (1)
32 / 199
The third study concluded 1.4 Lucene: Lucene index file format (2)
Published: 2009-12-12
In this paper, the position of the http://blog.csdn.net/forfuture1978/archive/2009/12/10/4976793.aspx csdn
4, above, has explained a specific format, Lucene saved to the Document from the Index to Segment to Field until the Term of positive information, but also to the Document from the Term reverse mapping information, there are other specific information Lucene. The following describes these three types of information 11.
4.1. Positive information
Index -> Segments (segments.gen, segments_N) -> Field (fnm, fdx, fdt) -> Term (tvx, tvd, tvf)
The above hierarchy is not very accurate, because segments.gen and segments_N save the segment (segment) meta-data information
(Metadata), is actually one for each Index, the segment of the real data is stored in the field (Field) and words (Term) of the.
4.1.1. Paragraph metadata information (segments_N)
An index (Index) can have more than segments_N (As to how multiple segments_N, after describing the details will be completed for example), but when we want to open an index, we must choose one to open, how Select which segments_N it?
Lucene adopt the following procedure:
• First, select all the segments_N N largest. Reference to the basic logic
SegmentInfos.getCurrentSegmentGeneration (File [] files), its basic idea is that in all segments beginning with and is not segments.gen of the file, select the N largest as genA.
• Second, open segments.gen, which holds the current N value. The format is as follows, read the publication of the number (Version), and then read out the two N, if both are equal, then as genB.
IndexInput genInput =
directory.openInput (IndexFileNames.SEGMENTS_GEN );//" segments.gen "
int version = genInput.readInt ();// read publication of the number
if (version == FORMAT_LOCKLESS) (/ / If the version number correctly
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
33 / 199
long gen0 = genInput.readLong ();// read the first N
long gen1 = genInput.readLong ();// read second N
if (gen0 == gen1) (/ / If both are equal was genB
genB = gen0;
)
)
• Third, in the received genA and genB select that as the current largest N, had just opened segments_N file. The basic logic is as follows:
if (genA> genB)
gen = genA;
else
gen = genB;
Following figure is segments_N specific format:
• Format:
◦ index file format version number.
◦ Because Lucene is a continuous development process, so different versions of Lucene, the index file formats are not the same, then set a version number.
◦ Lucene 2.1 this value is -3, Lucene 2.9, this value is -9.
◦ When a version with another version of IndexReader reading the index generated when this value is different because of error.
• Version:
◦ index version number, records the IndexWriter will modify the document submitted to the index number.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
34 / 199 ◦ of its initial value in most cases read from the index file which only created when the index began, were given the current time, has made a unique value.
◦ change its value in IndexWriter.commit-> IndexWriter.startCommit-
> SegmentInfos.prepareCommit-> SegmentInfos.write-> writeLong (+ + version)
◦ the initial value of the first to take a time because we do not care about the changes back to the index IndexWriter specific number, but in the end more concerned about what is the latest. IndexReader often compare their own version and index files
to determine whether the same version of this IndexReader was opened, there has not been updated IndexWriter.
/ / Click function in DirectoryReader there.
public boolean isCurrent () throws CorruptIndexException, IOException (
return SegmentInfos.readCurrentVersion (directory) == segmentInfos.getVersion ();
)
• NameCount
◦ is the next new segment (Segment) of the section name.
◦ all belong to the same segment of the index file to the section name as the file name, usually _0.xxx, _0.yyy, _1.xxx, _1.yyy ... ...
◦ newly formed section of the section name is usually the largest section of the original name plus one.
◦ as the index, NameCount read out is 2, indicating a new segment for the _2.xxx, _2.yyy
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
35 / 199 • SegCount
◦ segment (Segment) of the number.
◦ image above, this value is 2.
• SegCount a section of metadata information:
◦ SegName
▪ the section name, all belong to the same section of the paper are to the section name as the file name.
▪ shown above, the first paragraph of the section entitled "_0", the second paragraph of the section entitled "_1"
◦ SegSize
▪ The number of documents contained in this paragraph ▪ number is included in this document, however, have been removed, the document has not optimize because optimize before
Lucene segment contains all the indexed documents, and deleted documents are stored in. Del file, in the search process, that starts to be deleted paragraphs read the document, and then reuse. Del The signs will be filtered out in this document.
▪ The code follows the formation of the index on the map, you can see the index of the two documents formed _0 paragraph, and then delete one of them, forming a _0_1.del, another index of the two documents section of the formation of _1 and then delete them a form
_1_1.del. So in two paragraphs, this value is 2.
IndexWriter writer = new IndexWriter (FSDirectory.open (INDEX_DIR), new
StandardAnalyzer (Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile (false);
indexDocs (writer, docDir); / / docDir, only two documents
/ / The document one is: Students should be allowed to go out with their friends, but not allowed to
drink beer.
/ / Document 2 is: My friend Jerry went to school to see his students but found them drunk
which is not allowed.
writer.commit ();// submit two documents, the formation of _0 section.
writer.deleteDocuments (new Term ("contents", "school "));// delete the document 2
writer.commit ();// submit delete form _0_1.del
indexDocs (writer, docDir); / / re-index 2 document, Lucene can not distinguish the different document and the document, which counted two new document.
writer.commit ();// submit two documents to form a paragraph _1
writer.deleteDocuments (new Term ("contents", "school "));// add the document to delete the second two
writer.close ();// submit delete form _1_1.del
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
36 / 199 •
◦ DelGen
▪. Del ▪ Lucene file version number, in the optimize, remove the document is saved in. Del files.
▪ In Lucene 2.9, the document delete the following ways:
▪ IndexReader.deleteDocument (int docID) is IndexReader to delete by document number.
▪ IndexReader.deleteDocuments (Term term) is included with the IndexReader to delete the word
(Term) of the document.
▪ IndexWriter.deleteDocuments (Term term) is included with the IndexWriter to delete the word
(Term) of the document.
▪ IndexWriter.deleteDocuments (Term [] terms) is included with the IndexWriter to delete the words (Term) of the document.
▪ IndexWriter.deleteDocuments (Query query) is removed to satisfy this query IndexWriter (Query) document.
▪ IndexWriter.deleteDocuments (Query [] queries) is removed to satisfy these queries IndexWriter (Query) document.
▪ The original version of Lucene IndexReader removal has been accomplished by, although in Lucene 2.9 can be used in IndexWriter to remove, but in fact the real implementation is in the IndexWriter, the preservation of the readerpool, when the IndexWriter submitted to the index file when deleted still is
readerpool in the corresponding IndexReader, and use IndexReader to be deleted. The following code illustrates:
IndexWriter.applyDeletes ()
-> DocumentsWriter.applyDeletes (SegmentInfos)
-> Reader.deleteDocument (doc);



▪ DelGen is when IndexWriter submitted to the index file delete operation, the plus one, and generate new. Del files.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
37 / 199
IndexWriter.commit ()
-> IndexWriter.applyDeletes ()
-> IndexWriter $ ReaderPool.release (SegmentReader)
-> SegmentReader (IndexReader). Commit ()
-> SegmentReader.doCommit (Map)
-> SegmentInfo.advanceDelGen ()
-> If (delGen == NO) (
delGen = YES;
) Else (
delGen + +;
)
IndexWriter writer = new IndexWriter (FSDirectory.open (INDEX_DIR), new
StandardAnalyzer (Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile (false);
indexDocs (writer, docDir); / / index of two documents, one containing "school", the other one contains the "beer"
writer.commit ();// submit two documents to the index file, the formation of segment (Segment) "_0"
writer.deleteDocuments (new Term ("contents", "school "));// delete include" school "of the document, is to delete the two documents one.
writer.commit ();// delete the index file submission form "_0_1.del"
writer.deleteDocuments (new Term ("contents", "beer "));// delete include" beer "in the document, the document is actually deleted the other one two.
writer.commit ();// delete the index file submission form "_0_2.del"
indexDocs (writer, docDir); / / index of two documents, and the same as the previous document, but Lucene can not distinguish, that is the other two documents.
writer.commit ();// submit two documents to the index file, the formation of segment "_1"
writer.deleteDocuments (new Term ("contents", "beer "));// delete include" beer "in the document, in which section" _0 "has no delete section" _1 "be deleted one.
writer.close ();// delete the index file submission form "_1_1.del"
The formation of the index file is as follows:
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
38 / 199 •
◦ DocStoreOffset
◦ DocStoreSegment
◦ DocStoreIsCompoundFile
▪ For the domain (Stored Field) and the word vector (Term Vector) storage can have a different way, that is, each segment can be
(Segment) a separate domain and store their own word vector information, you can also share multiple segment domains and the word vector, store them in a paragraph to go.
▪ If DocStoreOffset is -1, then this section is stored in its own separate domain and the word vector file from a storage point of view,
If the paragraph section entitled XXX, then this section has its own XXX.fdt, XXX.fdx, XXX.tvf, XXX.tvd,
XXX.tvx file. DocStoreSegment and DocStoreIsCompoundFile here are not saved.
▪ If DocStoreOffset not -1, then DocStoreSegment preserve the name of the shared segment, such as
YYY, DocStoreOffset was the domain of this paragraph and the word vector information in a shared segment offset. The paragraph does not own XXX.fdt, XXX.fdx, XXX.tvf, XXX.tvd, XXX.tvx documents, but the information stored in the shared section of the YYY.fdt, YYY.fdx, YYY.tvf, YYY. tvd, YYY.tvx file.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
39 / 199 ▪ DocumentsWriter has two member variables: String segment is the current index information stored in the segment,
String docStoreSegment is a vector field and the word information storage section. The same can be different between the two determines the domain and the word vector information is stored in this paragraph and other paragraphs or shared.
▪ IndexWriter.flush (boolean triggerMerge, boolean flushDocStores, boolean
flushDeletes) in the second parameter flushDocStores affect whether alone or shared storage. Actually, the ultimate impact is DocumentsWriter.closeDocStore (). Whenever flushDocStores is false,
closeDocStore not call the next one added to the index file in the domain and the word vector information is shared with a segment. Until flushDocStores as true when, closeDocStore is called, so the next added to the index file in the domain and the word vector information will be saved in a new paragraph, the share of a different segment (here be pointed out that Lucene's the realization of a very strange, although the next field and the word vector information is saved to a new paragraph, however, is the section name to be identified, in initSegmentName in when
docStoreSegment == null, they were set to the current segment, rather than under a new segment,
docStoreSegment = segment, so will appear as the following example of the phenomenon).
▪ Fortunately shared storage is not a vector field and the word is often used to achieve too or defective, for the time being to explain this.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
40 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
41 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
42 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
Article 43 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
44 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
45 / 199
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
46 / 199 •
◦ HasSingleNormFile
▪ in the search process, the standardization factor (Normalization Factor) will affect the final score document.
▪ the importance of different documents in different, the importance of different domains are also different. Thus each document for each domain can have its own standardized factor.
▪ If HasSingleNormFile to 1, all standardized factor is there. Nrm file.
▪ If HasSingleNormFile than 1, each domain has its own standardized factor file. FN
◦ NumField
▪ ◦ NormGen the number of domain
▪ If each domain has its own standardized factor file, this array describes the standardized factor for each file version number
Ie. FN of the N.
◦ IsCompoundFile
▪ whether the file is saved as composite, that is the same paragraph of the document in accordance with a certain format, saved in a file which,
This reduces the number each time you open the file.
▪ whether the compound file, the interfaces IndexWriter.setUseCompoundFile (boolean) setting.
▪ found documents with non-compliance file comparison below:
Non-complex documents:
Compound documents:

◦ DeletionCount
▪ record the paragraph number of documents deleted.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
47 / 199 ◦ HasProx
▪ If at least one segment omitTf is false, ie, word frequency (term freqency) needs to be saved, then HasProx
1, 0 otherwise.
◦ Diagnostics
▪ debugging information.
• User map data
◦ save the user from string to string map Map
• CheckSum
◦ segment_N this file checksum.
Read this file format reference SegmentInfos.read (Directory directory, String segmentFileName):
• int format = input.readInt ();
• version = input.readLong (); / / read version
• counter = input.readInt (); / / read counter
• for (int i = input.readInt (); i> 0; i -) / / read segmentInfos
◦ add (new SegmentInfo (directory, format, input));
▪ name = input.readString ();
▪ docCount = input.readInt ();
▪ delGen = input.readLong ();
▪ docStoreOffset = input.readInt ();
▪ docStoreSegment = input.readString ();
▪ docStoreIsCompoundFile = (1 == input.readByte ());
▪ hasSingleNormFile = (1 == input.readByte ());
▪ int numNormGen = input.readInt ();
▪ normGen = new long [numNormGen];
▪ for (int j = 0; j
▪ normGen [j] = input.readLong ();
◦ isCompoundFile = input.readByte ();
◦ delCount = input.readInt ();
◦ hasProx = input.readByte () == 1;
◦ diagnostics = input.readStringStringMap ();
• userData = input.readStringStringMap ();
• final long checksumNow = input.getChecksum ();
• final long checksumThen = input.readLong ();
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
48 / 199
4.1.2. Field (Field) meta-data information (. Fnm)
A segment (Segment) contain multiple domains, each domain has a number of metadata information stored in. Fnm files,. Fnm file format is as follows:
• FNMVersion
◦ is the fnm file version number, for Lucene 2.9 to -2
• FieldsCount
◦ The number of field • an array of fields (Fields)
◦ FieldName: domain name, such as the "title", "modified", "content" and so on.
◦ FieldBits: a flag, indicating that this domain is indexed ▪ Low: 1 indicates this field is indexed, 0 not be indexed. So-called index, that is to put the inverted table.
▪ only indexed fields can be seized.
▪ Field.Index.NO said not to be indexed.
▪ Field.Index.ANALYZED said not only be indexed, and is word, such as index "hello
world ", whether they are found" hello ", or search" world "can be seized.
▪ Field.Index.NOT_ANALYZED that, although indexed, but regardless of the word, such as index "hello
world ", the only if found" hello world ", can search the collection for" hello "and search" world "is not found.
▪ a domain that can be indexed out, it can be stored, only to be stored in the domain is not search, but found through the document number, and more for do not want to be found, but in other fields can be searched by case , can number as the document is returned to the user's domain.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
49 / 199 ▪ Field.Store.Yes said store this field, Field.Store.NO said they did not store this field.
▪ penultimate position: 1 for saving word vector, 0 to not save the word vector.
▪ Field.TermVector.YES that save word vector.
▪ Field.TermVector.NO said they did not save the word vector.
▪ bottom third: a vector that word saved in the location information.
▪ Field.TermVector.WITH_POSITIONS
▪ fourth from bottom: 1, save that the word vector offset information.
▪ Field.TermVector.WITH_OFFSETS
▪ bottom fifth: 1 said they did not save the standardized factor ▪ Field.Index.ANALYZED_NO_NORMS
▪ Field.Index.NOT_ANALYZED_NO_NORMS
▪ sixth in last: it save the payload
To understand the domain meta-data information, but also understand the following:
• Position (Position) and offset (Offset) position is based on the difference ◦ word Term, the offset is based on the letters or Chinese characters.
• Index Domain (Indexed) and storage field (Stored) Why the difference ◦ a domain will be stored (store) that has not been indexed (Index) then? In all of the information in a document, there is some information that may not want to be indexed so you can search that, but when this document as other information is searched, you can return along with other information.
◦ For example, a graduate student, you finally wrote a dissertation to your supervisor, your supervisor has to his first of the second author and you do, but do not want you to mentor others in the paper to search your system Time to find the name of the paper, so in the paper system, the second of the Field of Indexed set false, so other people search for your name, never know that you wrote this paper, only in other people search for your instructor name to find your article, the second in a corner of your statement.
• payload use ◦ we know, the inverted index is stored in table form, for each word are saved as a linked list that contains the word, of course, to speed up queries, the list stored in multiple tables with a jump.
◦ Payload information is stored in the back row of the table, together with the document number stored, are used for storage and some information relevant to each document. Of course, this part of the information can be stored in a domain where (stored Field), both from the function is basically the same, but when the information to be stored a lot of time, stored in the inverted table, using jump tables, help greatly improve the search speed.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
50th / 199 ◦ Payload is stored as shown below:

◦ Payload has the following main uses:
▪ store each document has the information: for example, sometimes, we want to give each document assigned a number of our own documents, rather than using Lucene own document number. So we can declare a special field (Field) "_ID" and a special word (Term) "_ID", so each document contains the word "_ID", then the word "_ID" inverted table inside for each In terms of a document, each has a payload, payload inside so we can save our own document number. Whenever we get a Lucene document number when the jump table can find from our own document number.
http://forfuture1978.javaeye.com 1.4 Lucene The third study concluded: Lucene index file format (2)
51 / 199
/ / Declare a specific domain and specific words
public static final String ID_PAYLOAD_FIELD = "_ID";
public static final String ID_PAYLOAD_TERM = "_ID";
public static final Term ID_TERM = new Term (ID_PAYLOAD_TERM, ID_PAYLOAD_FIELD);
/ / Declare a special TokenStream, it only generates a term (Term), is that particular word, in particular inside the domain.
static cl

相关文章
  • Lucene3.0 Principle and Code Analysis 2010-04-09

    Lucene 3.0 Principles and Code Analysis: forfuture1978 http://forfuture1978.javaeye.com This series of articles will detail the latest version of Lucene is almost the basic principles and code analysis. http://www.javaeye.com - do share the best soft

  • Ruby On Rails-2.0.2 source code analysis (1)-Rails Start 2010-11-09

    Preface This article is for Ruby On Rails 2.0.2 source code analysis, learning and research. The tools used NetBean 6.1 Beta, WEBRick, SciTE, ruby-debug-base (0.10.0), ruby-debug-ide (0.1.10). Ruby version is 1.8.6. How analysis should be summarized,

  • Lucene3.0 principle and sub-word segmentation system 2010-10-20

    Segmentation theory Indexing and query process, the items are the basic units of morphemes. The basic entry is through the sub-word morpheme be. This process determines the final amount of the index matching process unit. Segmentation process of buil

  • Lucene 3.0 Principles and Code Analysis 2010-03-29

    This series of articles will be described in detail the latest version of Lucene is almost the basic principles and code analysis. In which the overall structure and Lucene 2.9 index file format is, the indexing process analysis is the Lucene 3.0. In

  • No website MSN Robot rice Implementation Principle and Code Dotmsn2.0 2010-04-12

    "Transfer from http://tb.blog.csdn.net/TrackBack.aspx?PostId=1749348 Authors: funy> No rice, jack distortion are MSN, QQ, GTalk Robot Implementation Principle and Code - No rice, jack distortion MSN Robot implementation principle (DotMSN2.0 analyt

  • Lucene3.0 Source Analysis (1) in the Eclipse / MyEclipse project to build Lucene3.0 2010-06-19

    Source analysis of the first step is to establish related projects in the IDE, and then go step by step learning. Lucene3.0.2 project I set up some unnecessary long way around, so feel the need to write this blog: 1. Download Source: Download Lucene3

  • Tomcat 5.5.26 source code analysis - start the process (2) 2010-03-25

    init method Catalina_Home and Catalina_Base Initialize the system class loader Tomcat's class loader system initClassLoaders code Two questions Catalina Object load method Catalina class command-line arguments Loading process start Method await state

  • lucene3.0 page display and highlight 2010-10-14

    lucene3.0 page display and highlight the page and highlight the implementation of Pagination Class package com.cee.com; import java.util.List; // Pagination Class public class PageBean { private List list; // To return to the list a page of records p

  • Paodingjieniu lucene3.0.2 compatibility issues (transfer) 2010-11-12

    http://code.google.com/p/paoding/issues/detail?id=49 Reported by project member reno.gan, Dec 03, 2009 lucene 3.0 Removed some obsolete methods, such as tokenStream.next() Method in order to allow Paoding Can Enough to run the lucene 3.0 environment

  • The main changes Lucene3.0 2010-12-27

    I. Overview Lucene3.0 (hereinafter referred to as 3.0) was released 2009-11-25, version 3.0 is a major, significant change. In the API to do a lot of adjustments, have been removed prior to waste a lot of methods and classes, and supports a lot of Ja

  • Glibc memory management - ptmalloc2 source code analysis (l) 2011-05-30

    5. Source Code Analysis Are two main techniques for the details of the source code for analysis, hoping to further understand ptmalloc implementation, to achieve the ultimate free from doubts. The documents include the main analysis arena.c and mallo

  • Glibc memory management - ptmalloc2 source code analysis (XII) 2011-05-30

    5 source code analysis The main points of the source code to analyze the details of technique, hoping to achieve a better understanding of ptmalloc, so that the ultimate no confusion. Including the analysis of the main arena.c and malloc.c, these two

  • [Change] lucene3.0 entry instance 2009-05-24

    Transfer from: http://cumtfirefly.javaeye.com/blog/543664 lucene3.0 on 2009-11-25 released La. , but the online entry examples are directed lucene3.0 before, compared to the previous version, looks like no small change. I started learning from scratc

  • ThreadLocal Source Code Analysis 2009-09-17

    Had previously been associated ThreadLocal reproduced the article, but it has been in a confused state, when the chance to see the recent review of the blog's article, it made for such an in-depth analysis and summary. (Also note, I am the source of

  • Spring source code analysis (4): Spring MVC 2009-10-24

    The following Spring MVC framework of the code we analyzed, for webApplicationContext correlation analysis can be found in the previous document, we are here to focus on analysis of Spring Web MVC framework for implementation. DispatcherServlet we st

  • Spring source code analysis (6): Spring declarative transaction processing 2009-11-30

    We take a look at Spring's transaction code, the use of Spring Management Services are two kinds of declarative and programmatic way through the AOP declarative transaction management code to achieve the things as a package to horizontal aspects of t

  • Spring source code analysis (8): Spring-driven implementation of Hibernate 2009-03-26

    O / R tool came to simplify the complex information of many persistent development. Spring application developers through the Spring provides O / R program is more convenient to use a variety of persistence tools, such as Hibernate; Here we have Spri

  • [Thread] glib library continued the thread pool code analysis 2010-03-16

    Reprinted from: http://blog.csdn.net/sanlongcai/archive/2007/08/18/1749949.aspx Glib-2.12.9 in the thread pool you can see it as a class, the process can create multiple thread pools in this class of objects. Each thread pool thread object can be cre

  • Mysql source code analysis series 2010-03-30

    Mysql source code analysis series (2): the source code structure Mysql source code include the client code, server-side code, testing tools and some database structure, the following directories we do introduce the more important. BUILD This director

  • [Lucene3.0 the first glimpse of] the index to create (3): DocumentWriter Process II 2010-04-10

    The then "create an index (2): DocumentWriter processing a" 1.3.2 The second workshop - DocInverterPerField DocInverterPerField responsible DocFieldProcessorPerThread object Fieldable [] array of the contents of the establishment of inverted ind