heritrix algorithm using ELFHash so often multi-threaded crawl DNS to stop many times only to resolve the situation. Some of the reasons the Internet that are not clear. After two days of research, the basic finding of the cause and provide a simple solution.
When using multiple threads to crawl, the seed url to go through DNS resolution, DNS resolution will take some time, when the DNS resolution has not yet finished, the seed url is complete and retry has crawled many times, so when the conclusion of DNS resolution No new url to join the queue. In the single-threaded mode, the seed url is in the process of crawling out of DNS resolution to crawl again after the results, but the multi-threaded situation, this order is disrupted, it may not resolve the seed url in the DNS to try out crawl, and there will be errors, leading to the back of the DNS resolution does not need to grab the url after the. Specific queue may be due to a relationship with the url, multi-threaded in DNS and the corresponding seed is not in the same queue, they crawl the order changed. On the other hand, in order to crawl when configuring a retry-delay-seconds, in a multi-threaded cases, failure to re-crawl through the test I found that the time delay is not a set time, almost no delay to re-crawl . This also led after the failure of the seed url in the crawl did not wait but to continue to try to crawl seeds, so that although a number of retries to crawl seed url, but the DNS resolution did not come out good results, or only the final result of multiple resolution DNS over. In a word, multi-threaded crawl crawl led to some delay in the order of functional change and failure, thus resulting in only the DNS resolution several times to stop the problem. ps: in case of problems, max-retries value generally is the last number of DNS resolution. We can try.
The solution: I just broke a very simple way, looking forward to Daniel, with a better way to determine url in the thread whether it is the seed, if it is let the thread sleep for some time waiting for DNS resolution to complete. The code is specific to change ToeThread class run () method. In processCrawlUri (); followed by:
if (curi.isSeed ())
Thread sleep time can be set as the network latency, bigger does not matter, after all, the seed url is very small, little effect on efficiency.
There is a way to set up a large number of max-retries me, I can here more than 30.
Sure there are other good ways, we welcome the discussion.
Thank heritrix Group: 10,447,185 and the main group guoyunsky. A group blog written by a great master.