Lucene indexing speed impact causes and indexing speed to improve skills [change]

2010-07-15  来源:本站原创  分类:Internet  人气:158 

Keywords: causes and effects lucene indexing speed to improve indexing speed reading skills in a foreign language online article, which introduces the Lucene indexing speed to improve the skills to others.
Under the first view of the main factors affecting the index:

This parameter determines the number of write memory index documents, after the number reaches the index put the memory to write the hard disk, generate a new index segment file.
So the argument is a memory buffer, the greater the index generally faster.
MaxBufferedDocs This parameter is disabled by default, because Lucene also used another parameter (RAMBufferSizeMB) control the bufffer index file number.
In fact MaxBufferedDocs and RAMBufferSizeMB these two parameters can be used together, when used together with a trigger condition is met as long as the hard disk are written to generate a new index segment file.

Document control index for the buffer memory limit, if the index buffer to reach the limit on the number of documents written to the hard disk. Of course, in general, the greater the index only faster.
When we do not determine the size of the document, this parameter is very useful, and will not outofmemory error.

This parameter is used for sub-index (Segment) merger.
Lucene index on the whole, in the proceed like this, first write memory index, trigger certain restrictions this into an hard disk, generate an independent sub-index-lucene in called Segment. Generally speaking, these sub-indexes need to be combined into an index, that is, optimize (), otherwise it will affect the search speed, but also may lead to open too many files.
MergeFactor This parameter is the number of control when the hard disk sub-index segments, we need to merge these indexes are washed to a slightly larger index of.
MergeFactor this can not be set too large, especially when MaxBufferedDocs more hours (segment more), otherwise it will lead to open too many files error, or even result in an error outside the virtual machine.

Note: Lucene index merge mechanism is not in default to combine two, like many merged into the final segment of a large index, the greater the cost of memory so MergeFactor more, the index will speed faster, but I feel great example 300, the last time the merger is still very full. Batch indexing should MergeFactor> 10

Some tips to speed up the index:

• Make sure you are using the latest Lucene version.

• Try to use the local file system

Remote file systems will generally reduce the indexing speed. If the index must be located in a remote server, try to create an index of the first locally and then distributed to the remote server.
• Use faster hardware, especially the faster IO devices

• During the recovery in the index with a single IndexWriter instance

• Use in accordance with the memory consumption to replace according to the number of documents Flush Flush
In previous versions of Lucene 2.2, you can add the document after each call ramSizeInBytes method, when the index consumes too much memory, and then call the flush () method. In doing so, the index size of a large number of small documents or document is particularly effective under uncertain circumstances. You must first maxBufferedDocs parameter is large enough to prevent the writer based on the document number of flush. But attention, do not set this value too much, otherwise you will encounter Lucene-845 No. BUG. But this BUG has been resolved in the 2.3 version.

In Lucene2.3 later. IndexWriter can automatically call the basis of memory consumption flush (). You can writer.setRAMBufferSizeMB () to set the cache size. When are you going to flush according to memory size, make sure not to set MaxBufferedDocs value elsewhere. Otherwise, conditions will become uncertain flush (who should comply with the conditions in accordance with the Who).

• The range you can afford to use more memory before the flush uses more memory means that Lucene will generate even greater when the index segment, the merger also means that the number has reduced. In Lucene-843 test, about 48MB of memory may be a more appropriate value. However, your program may be another value. This is related to a different machine also has a certain relationship, their more testing, select a trade-off value.

• Turn off the compound file format called setUseCompoundFile (false) to close the compound file option. Generate complex documents will consume more time (through Lucene-888 test, probably will increase 7% -33% of the time). Note, however, this would greatly increase the search and index the number of used file handles. If the merger is also a great factor, you may run out of file handles the situation there.

• Document and Field examples of reuse in lucene 2.3, the new one called setValue method allows you to change the field value. The advantage is that you can reuse the indexing process of a Filed instance. This will greatly reduce the burden of GC.

Document best to create a single instance, and then add the fields you want to document. At the same time reuse Field added to the document instance, GM calls the corresponding changes in SetValue method the value of the corresponding fields. Document and then re-added to the index.

Note: You can not multiple fields in a document share a Field instance, the document added to the index before, Field's value should not be changed. This means that if you have three fields, you must create three Field instance, and then add a course after the Document reuse them.

• in your analyzer Analyzer to use a single Token instance in the parser token to share a single instance of GC will also ease the pressure.

• Token to use in the char [] instead of String interface to interface to that data in the Lucene 2.3, Token you can use the char array to represent his data. This avoids building the string and the GC recovery string consumption. Through with a single Token instance and use the char [] interfaces, you can avoid creating a new object.

• Set to false autoCommit
In Lucene 2.3 has stored on Term Vector fields and document a great deal of optimization to save time for large indexes combined. You can reuse a single IndexWriter instance autoCommit set to false to witness the benefits of these optimization. Note that this will lead to searcher before closing the IndexWriter index will not see any updates. If you think this is important to you, you can continue to autoCommit set to true, or periodically open and close your writer.

• If you want to index a lot of small text field, without special needs, I suggest you to the small text field's contents into a large field, then only the index contents. (Of course you can continue to store those fields)

• increase mergeFactor merge factor, but not the bigger the better big factor to delay the merger of the combined segment of time, this can improve the indexing speed, because the merger is a very time-consuming part of the index. However, this will reduce the speed of your search. At the same time, you may run out of file handles you merge factor if you set too. Setting the value too large may reduce the indexing speed, because it means will also be more merged segment, will greatly increase the burden on the hard disk.

• Close all you did not actually use the feature if you store the field, but did not use them to query, then do not store them. Term Vector same is also true. If you have a lot of fields in the index to close the unnecessary features of these fields will have a great index to help speed boost.

• Use a faster time to analyze the document parser will consume a very long time. For example, StandardAnalyzer relatively time-consuming, especially in the Lucene 2.3 version before. You can try to use a simpler but faster parser meet your needs.

• speed up build times the document in the usual circumstances, documents Keneng yes external data sources (such as databases, file systems, spiders crawl from the site etc.), which are usually time-consuming Du Bi Jiao, Jin Liang optimization Huoqu their performance.

• Do not until you really need to optimize optimize arbitrary index (only required when a faster search speed)

• to share in an IndexWriter multithreaded
The latest hardware are suitable for high concurrency (multi-core CPU, multi-channel memory architecture, etc.), so use multiple threads to add a document will bring no small performance increase. Even a very old machine, adding the document would be better concurrent use of IO and CPU. The number of concurrent threads and more testing, access to a critical optimal value.

• Grouping the document index at the different machines and then merge the text if you have a large number of documents to index, you can put your document is divided into several groups, each machine in the index number of different groups, and then use writer.addIndexesNoOptimize to merge them into a final index file.

• Performance test procedure if the above suggestions are not acted on. I suggest you run the program under the performance test. Find out which part of your program more time-consuming. This usually will give you unexpected surprise.