heritrix multiple multi-threaded DNS resolution to stop crawling the causes and solutions

2010-11-18  来源:本站原创  分类:Java  人气:95 

heritrix algorithm using ELFHash so often multi-threaded crawl DNS to stop many times only to resolve the situation. Some of the reasons the Internet that are not clear. After two days of research, the basic finding of the cause and provide a simple solution.

When using multiple threads to crawl, the seed url to go through DNS resolution, DNS resolution will take some time, when the DNS resolution has not yet finished, the seed url is complete and retry has crawled many times, so when the conclusion of DNS resolution No new url to join the queue. In the single-threaded mode, the seed url is in the process of crawling out of DNS resolution to crawl again after the results, but the multi-threaded situation, this order is disrupted, it may not resolve the seed url in the DNS to try out crawl, and there will be errors, leading to the back of the DNS resolution does not need to grab the url after the. Specific queue may be due to a relationship with the url, multi-threaded in DNS and the corresponding seed is not in the same queue, they crawl the order changed. On the other hand, in order to crawl when configuring a retry-delay-seconds, in a multi-threaded cases, failure to re-crawl through the test I found that the time delay is not a set time, almost no delay to re-crawl . This also led after the failure of the seed url in the crawl did not wait but to continue to try to crawl seeds, so that although a number of retries to crawl seed url, but the DNS resolution did not come out good results, or only the final result of multiple resolution DNS over. In a word, multi-threaded crawl crawl led to some delay in the order of functional change and failure, thus resulting in only the DNS resolution several times to stop the problem. ps: in case of problems, max-retries value generally is the last number of DNS resolution. We can try.

The solution: I just broke a very simple way, looking forward to Daniel, with a better way to determine url in the thread whether it is the seed, if it is let the thread sleep for some time waiting for DNS resolution to complete. The code is specific to change ToeThread class run () method. In processCrawlUri (); followed by:

if (curi.isSeed ())

Thread.sleep (1000);

Thread sleep time can be set as the network latency, bigger does not matter, after all, the seed url is very small, little effect on efficiency.

There is a way to set up a large number of max-retries me, I can here more than 30.

Sure there are other good ways, we welcome the discussion.

Thank heritrix Group: 10,447,185 and the main group guoyunsky. A group blog written by a great master.

相关文章
  • heritrix multiple multi-threaded DNS resolution to stop crawling the causes and solutions 2010-11-18

    heritrix algorithm using ELFHash so often multi-threaded crawl DNS to stop many times only to resolve the situation. Some of the reasons the Internet that are not clear. After two days of research, the basic finding of the cause and provide a simple

  • Erlang DNS resolution order problems 2010-02-04

    Erlang's DNS resolution methods include file (read / etc / hosts file), dns (Erlang own DNS client), native (call external procedures inet_gethost the gethostbyname function with libc resolve domain names), including several ways you can In the kerne

  • The DNS resolution on IP cache java problem 2010-05-04

    java cache on the DNS resolution IP, the default cache timeout is -1 (in the permanent cache before restarting JVM), In the first visit after a domain name will resolve to the IP address of the cache, then directly from the cache to obtain the necess

  • DNS resolution failure solution 2010-09-17

    In the actual application process may be encountered DNS resolve error De issues, that is, when we visited a domain O'clock Wufawancheng Jiang their resolve To the work of the IP address, enter the website IP Zhijie is Keyizhengchang access, this is

  • DNS resolution knowledge 2011-10-07

    DNS resolution What is DNS DNS is Domain Name System (Domain Name System) acronym. DNS is the domain name to IP address of the conversion process. Network IP address is the numeric address that identifies your site, simply remember to use of the doma

  • [Technical tips] DNS resolution knowledge 2011-10-07

    DNS resolution What is DNS DNS is Domain Name System (Domain Name System) acronym. DNS is the domain name to IP address of the conversion process. Network IP address is the numeric address that identifies your site, simply remember to use of the doma

  • On the DNS resolution 2011-03-20

    1, DNS server record A record: the host records, interpreted as the IP address of the domain name NS record: the name server records, indicating that this region which is responsible for parsing the DNS server SOA record: indicate which DNS server is

  • PHP multi threaded web crawler 2010-07-19

    PHP using Curl Functions can transfer files to complete the various operations, such as simulation of the browser to send GET, POST request, etc., subject to php language itself does not support multi-threaded, so the development crawler efficiency i

  • DNS resolution to build a successful smart 2011-03-18

    After constantly discover and solve the problem a few days, and finally the use of stable production environment. bind-dlz + mysql + mycdn page management.

  • DNS reverse resolution nslookup ip 2010-03-11

    I have two IDC hosted a two servers, respectively. A server is 211.152.17.52 Another one is: 221.174.21.184 There were two different domain names pointing to both the ip on the IDC pantvchina.com -> 211.152.17.52 dynomedia-inc.com -> 221.174.21.184

  • Heritrix source code analysis (8) Heritrix8 processor (Processor) description 2010-04-14

    Heritrix to crawl data using multiple threads, each running basically go through the following eight-processor processing (seed URL, URL prerequisite excluded), so the formation of an entire process. Probably introduced to the following for each proc

  • Heritrix source code analysis (xv) a variety of problems related with 2010-11-04

    A blog and building Heritrix group for some time (say thank you for your attention), this blog will organize the problems encountered during this period. At the same time as their starting from May this year, not how contact with Heritrix, many thing

  • Heritrix source code analysis (eight) Heritrix8 processor (Processor) description 2010-11-21

    http://guoyunsky.javaeye.com/blog/643367 Heritrix to crawl data using multiple threads, each running basically go through the following 8 processor (seed URL, URL prerequisite excluded), so the formation of a whole process. Here is probably introduce

  • Heritrix source code analysis (o) Summary of various issues 2010-11-04

    This blog is the original article, welcome to reprint! Reproduced sure to indicate the source: http://guoyunsky.iteye.com/blog/802721 This blog has migrated to my independent blog: http://www.yun5u.com/ Welcome to the Heritrix group (QQ): 109148319,

  • Heritrix source code analysis (eight) Heritrix8 processor (Processor) Introduction 2010-11-21

    http://guoyunsky.iteye.com/blog/643367 Heritrix uses multiple threads to capture the data, each run to go through the following eight basic processor (seed URL, URL, except a prerequisite), so the formation of a whole process. Here is probably introd

  • Heritrix source code analysis (o) Summary of issues 2010-11-21

    http://guoyunsky.iteye.com/blog/802721 A blog and building Heritrix group for some time (here, thank you for your attention), this blog will sort the problems encountered during this time. And because their not from the beginning of May this year, ho

  • Foreign free dns services 2010-02-03

    Global routing DNS server Only 13 units worldwide routing DNS server (Route Server), 13 units in the routing server, the name was "A" to "M", of which 10 units set in the United States, and each one set in the United Kingdom, Sweden an

  • DNS domain name record types explained 2011-07-12

    DNS, Domain Name System or Domain Name Service (Domain Name System or Domain Name Service). Domain Name System on the Internet host for the assignment of domain names and IP addresses. As the computer on the network must have IP addresses to identify

  • [Transfer] Web server load balancing solution - DNS poll 2011-06-14

    Web server load balancing solutions - DNS poll Around early 2005, the public comment CAPE run more than a year, site traffic has been not simply rely on a Web server, a database server to support. Prepared to increase the several front-end Web server

  • GOOGLE build enterprise-wide mail and domain name resolution Raiders 2010-03-04

    You just need a domain name, even without a server. google business package is charged, but the great google for 50 people is free of charge. google E-mail us: The personal mailbox google gmail, its the best user experience, powerful functions, you m