First, the framework introduced
Recent projects the company to use to full-text search, retrieve objects are some of the Web page content, to use the web crawler tools.
Selection of the current technology has two main objects: Heritrix and Nutch. Both are open source Java framework, Heritrix is on the SourceForge open source products, Nutch is a subproject of Apache, they are called web crawler / Spider (Web Crawler), they are basically the same principles to achieve: the depth of traversal site resources to capture those resources to the local, the methods used are effective analysis of each site URI, and submit the Http request, to obtain the corresponding results, generate a local file and the corresponding log information.
Here are two of the introduction, taken from the network:
Heritrix is an "archival crawler" - used to obtain complete, accurate, depth of site content replication. Including access to images and other non-text content. Capture and store the relevant content. On the content of generosity, not page content changes. Re-crawl the URL is not the same be replaced for the previous. Reptile through the Web user interface starts, monitoring, adjustment, allowing the definition of flexibility to get the URL.
Difference between the two:
- Nutch can only obtain and store the contents of the index. Heritrix is to accept all. Strive to save the original page
- Nutch can trim the content, or content format conversion.
- Nutch optimal format for the database save content for later index; refresh replace the old content. The Heritrix is to add (append) new content.
- Nutch to run from the command line, control. Heritrix a Web control and management interface.
- Nutch not have a strong ability to customize, but now has some improvements. Heritrix parameters can be controlled more.
Second, the preliminary summary on the use Heritrix
Present a preliminary selection of Heritrix test done, with some conclusions:
1. On the installation:
The current version number is 1.12.1, the official website address is http://crawler.archive.org/ . General installation, that is, to extract the relevant directory, and then configure the system environment variable "HERITRIX_HOME" to the extract directory (Java environment is already configured.)
2. Installing the follow-up:
To% HERITRIX_HOME% \ heritrix-1.12.1.jar extract to a temporary directory, copy one of the profiles directory to% HERITRIX_HOME% \ conf \ directory, the Profile is used to solve the Heritrix default configuration on a Bug.
3. Configuration Management account:
Copy% HERITRIX_HOME% \ conf \ jmxremote.password.template to% HERITRIX_HOME% \ next, and rename it to "jmxremote.password". Edit the file contents after the part about the password:
monitorRole @ PASSWORD @ monitorRole admin
controlRole @ PASSWORD @ ==> controlRole admin
After modification is completed, save the file. The file's attributes and the need to "read." Then there is a very important step: in the file properties window jmxremote.password view the "Security" tab, the first under the label "group or user name" under the ownership of the file to confirm should only be part of your current system users and should not be a particular user group (eg Administrators), this should be a security mechanism Heritrix Bug. Otherwise, time will run Heritrix reports permission errors, you need to modify the properties of jmxremote.password file as "read only", but in fact have made the changes.
4. Run Heritrix:
CMD locate the% HERITRIX_HOME% \ bin, execute "heritrix - admin = admin: admin" command, you can start heritrix, there is little need to pay attention, heritrix default 8080 port, the port to ensure that the system did not conflict. Then can access http://127.0.0.1:8080 use heritrix provided WUI, the Web management console. And use "admin / admin" login.
The management console provides the default Heritrix provides all the configuration, and can create a Job and the implementation of the Job crawl the site.
5. A simple Job:
Heritrix configuration provides a very rich, but also very complex and difficult to correct when the beginning of the creation and implementation of a Job to crawl the site, their reading most of the Heritrix user documentation and a series of attempts to sum up a simple Job creation of the implementation of use cases, the use cases capture www.baidu.com under the page, but sub-domains (such as news.baidu.com) does not crawl, do the following for reference:
(1) WUI in the top navigation bar select the "Jobs", presents the first is the "Create New Job", select the fourth minor term "With defaults". The first two entries
Name and Description arbitrary, Seeds very important: http://www.baidu.com/ attention to the last backslash.
(2) Select the bottom of the "Modules", enter the Module configuration page (Heritrix the concept of extended functions are implemented by the module, you can implement your own modules to complete their own
The desired function). The first item "Select Crawl Scope" default "org.archive.crawler.deciderules.DecidingScope"
. Last third "Select Writers" delete the default "org.archive.crawler.writer.ARCWriterProcessor", added after
"Org.archive.crawler.writer.MirrorWriterProcessor", so that when the mission to capture the way the page will be mirrored on the local
Directory structure, rather than generate ARC archive files.
(3) Select "Modules" right "Submodules", the first item in the "crawl-order -> scope -> decide-rules -> rules" delete the
The "acceptIfTranscluded" (org.archive.crawler.deciderules.TransclusionDecideRule) of this scope to a crawl
Rules. When the Http requests to return 301 or 302 will go when the Heritrix web crawling under the other domains.
(4) In the second line of WUI navigation bar, select "Settings" enter the Job's configuration page, where two major changes: http-headers and user-agent under
from, their "PROJECT_URL_HERE" and "CONTACT_EMAIL_ADDRESS_HERE" replaced with your own content
("PROJECT_URL_HERE" to the "http://" at the beginning)
(5) in the WUI in the second line of the far right of the navigation bar, select "Submit job"
(6) in the WUI in the first line of the navigation bar select the first item in the "Console", click "Start", the official start crawling task, the length of time a network status and the capture site
In accordance with the above steps to be able to correctly perform a task of the site to crawl, crawl page stored in your working directory mirror folder. Job creation and execution on the process in a variety of settings can be found in the user manual, a more detailed explanation.