Heritrix preliminary summary of use

2010-11-21  来源:本站原创  分类:Internet  人气:135 

Transfer: http://jason823.javaeye.com/blog/84206

First, the framework introduced

Recent projects the company to use to full-text search, retrieve objects are some of the Web page content, to use the web crawler tools.

Selection of the current technology has two main objects: Heritrix and Nutch. Both are open source Java framework, Heritrix is on the SourceForge open source products, Nutch is a subproject of Apache, they are called web crawler / Spider (Web Crawler), they are basically the same principles to achieve: the depth of traversal site resources to capture those resources to the local, the methods used are effective analysis of each site URI, and submit the Http request, to obtain the corresponding results, generate a local file and the corresponding log information.

Here are two of the introduction, taken from the network:

Heritrix is an "archival crawler" - used to obtain complete, accurate, depth of site content replication. Including access to images and other non-text content. Capture and store the relevant content. On the content of generosity, not page content changes. Re-crawl the URL is not the same be replaced for the previous. Reptile through the Web user interface starts, monitoring, adjustment, allowing the definition of flexibility to get the URL.

Difference between the two:

  • Nutch can only obtain and store the contents of the index. Heritrix is to accept all. Strive to save the original page
  • Nutch can trim the content, or content format conversion.
  • Nutch optimal format for the database save content for later index; refresh replace the old content. The Heritrix is to add (append) new content.
  • Nutch to run from the command line, control. Heritrix a Web control and management interface.
  • Nutch not have a strong ability to customize, but now has some improvements. Heritrix parameters can be controlled more.

Second, the preliminary summary on the use Heritrix

Present a preliminary selection of Heritrix test done, with some conclusions:

1. On the installation:

The current version number is 1.12.1, the official website address is http://crawler.archive.org/ . General installation, that is, to extract the relevant directory, and then configure the system environment variable "HERITRIX_HOME" to the extract directory (Java environment is already configured.)

2. Installing the follow-up:

To% HERITRIX_HOME% \ heritrix-1.12.1.jar extract to a temporary directory, copy one of the profiles directory to% HERITRIX_HOME% \ conf \ directory, the Profile is used to solve the Heritrix default configuration on a Bug.

3. Configuration Management account:

Copy% HERITRIX_HOME% \ conf \ jmxremote.password.template to% HERITRIX_HOME% \ next, and rename it to "jmxremote.password". Edit the file contents after the part about the password:

monitorRole @ PASSWORD @ monitorRole admin
controlRole @ PASSWORD @ ==> controlRole admin

After modification is completed, save the file. The file's attributes and the need to "read." Then there is a very important step: in the file properties window jmxremote.password view the "Security" tab, the first under the label "group or user name" under the ownership of the file to confirm should only be part of your current system users and should not be a particular user group (eg Administrators), this should be a security mechanism Heritrix Bug. Otherwise, time will run Heritrix reports permission errors, you need to modify the properties of jmxremote.password file as "read only", but in fact have made the changes.

4. Run Heritrix:

CMD locate the% HERITRIX_HOME% \ bin, execute "heritrix - admin = admin: admin" command, you can start heritrix, there is little need to pay attention, heritrix default 8080 port, the port to ensure that the system did not conflict. Then can access http://127.0.0.1:8080 use heritrix provided WUI, the Web management console. And use "admin / admin" login.

The management console provides the default Heritrix provides all the configuration, and can create a Job and the implementation of the Job crawl the site.

5. A simple Job:

Heritrix configuration provides a very rich, but also very complex and difficult to correct when the beginning of the creation and implementation of a Job to crawl the site, their reading most of the Heritrix user documentation and a series of attempts to sum up a simple Job creation of the implementation of use cases, the use cases capture www.baidu.com under the page, but sub-domains (such as news.baidu.com) does not crawl, do the following for reference:

(1) WUI in the top navigation bar select the "Jobs", presents the first is the "Create New Job", select the fourth minor term "With defaults". The first two entries

Name and Description arbitrary, Seeds very important: http://www.baidu.com/ attention to the last backslash.

(2) Select the bottom of the "Modules", enter the Module configuration page (Heritrix the concept of extended functions are implemented by the module, you can implement your own modules to complete their own

The desired function). The first item "Select Crawl Scope" default "org.archive.crawler.deciderules.DecidingScope"

. Last third "Select Writers" delete the default "org.archive.crawler.writer.ARCWriterProcessor", added after

"Org.archive.crawler.writer.MirrorWriterProcessor", so that when the mission to capture the way the page will be mirrored on the local

Directory structure, rather than generate ARC archive files.

(3) Select "Modules" right "Submodules", the first item in the "crawl-order -> scope -> decide-rules -> rules" delete the

The "acceptIfTranscluded" (org.archive.crawler.deciderules.TransclusionDecideRule) of this scope to a crawl

Rules. When the Http requests to return 301 or 302 will go when the Heritrix web crawling under the other domains.

(4) In the second line of WUI navigation bar, select "Settings" enter the Job's configuration page, where two major changes: http-headers and user-agent under

from, their "PROJECT_URL_HERE" and "CONTACT_EMAIL_ADDRESS_HERE" replaced with your own content

("PROJECT_URL_HERE" to the "http://" at the beginning)

(5) in the WUI in the second line of the far right of the navigation bar, select "Submit job"

(6) in the WUI in the first line of the navigation bar select the first item in the "Console", click "Start", the official start crawling task, the length of time a network status and the capture site

Depth.

In accordance with the above steps to be able to correctly perform a task of the site to crawl, crawl page stored in your working directory mirror folder. Job creation and execution on the process in a variety of settings can be found in the user manual, a more detailed explanation.

相关文章
  • Heritrix preliminary summary of use 2010-11-21

    Transfer: http://jason823.javaeye.com/blog/84206 First, the framework introduced Recent projects the company to use to full-text search, retrieve objects are some of the Web page content, to use the web crawler tools. Selection of the current technol

  • Heritrix uses a preliminary summary 2010-11-21

    Turn: http://jason823.iteye.com/blog/84206 First, the framework introduced To the company's recent project to use full-text search, retrieve objects, some website content, to use the Web crawler tool. Selection of current technology has two main obje

  • MYSQL preliminary summary 2011-07-17

    Has little to learn the database, the emotions, write this summary. . . (Khan! In fact the truth is...) MYSQL, rumors are open free of charge, while the Chinese people's psychology is: no charge no fees good, charges to your good, but expensive to br

  • ZK framework of the preliminary summary of [2] on the server-push --- 2010-06-30

    Push services on the ZK framework to achieve, ZK3 and ZK5 very different. This is also in the small talk and how-to wiki inside to see. server-push service to promote the rise of technology, mainly because of the stateless http link, causing the serv

  • (Transfer) ZK preliminary summary of the framework [2] on the server-push --- 2010-07-05

    Push services on the ZK framework to achieve, ZK3 and ZK5 very different. This is also in the small talk and how-to wiki inside to see. server-push service to promote the rise of technology, mainly because of the stateless http link, causing the serv

  • heritrix learning summary 2010-09-03

    1 Download and extract from http://crawler.archive.org/ downloaded extract to the local E: \ heritrix-1.14.3 2 Configure environment variables HERITRIX_HOME = E: \ heritrix-1.14.3 After the additional path;% HERITRIX_HOME% \ bin 3 Configuration herit

  • YUI preliminary summary of learning 2010-10-12

    Framework for a period of time YUI2 learning applications, but also with some of his own experience. YUI2's core foundation is yahoo \ dom \ event this three-part, YUI's more like a utility part of the toolset components, YUI's widgets is done by som

  • Department of points. Item management paper writing tips 2010-08-04

    Recently, there are always a friend to me for advice on the Department of points, key management method of writing paper, a preliminary summary of this: 1. Otherwise write a thesis title: No, but we must remember that the answer sheet circle. Paper t

  • The current project, several problems exist 2009-12-18

    1 team roles, responsibilities should be distributed? Thinking: project scope has a shopping site, including the front door and back-office management of two subsystems. The current team of 12 people, including 2 senior developers, one with managemen

  • Department of points of possessing some of the techniques of writing paper 2010-08-04

    Recently, there are always some friends to me for advice on the Department of points, of possession of thesis writing method, a preliminary summary of this: 1. Or else write title of the paper: No, but we must remember that circle on the answer sheet

  • Heritrix source code analysis (o) Summary of various issues 2010-11-04

    This blog is the original article, welcome to reprint! Reproduced sure to indicate the source: http://guoyunsky.iteye.com/blog/802721 This blog has migrated to my independent blog: http://www.yun5u.com/ Welcome to the Heritrix group (QQ): 109148319,

  • Heritrix source code analysis (o) Summary of issues 2010-11-21

    http://guoyunsky.iteye.com/blog/802721 A blog and building Heritrix group for some time (here, thank you for your attention), this blog will sort the problems encountered during this time. And because their not from the beginning of May this year, ho

  • A Summary of how to do preliminary design 2011-04-09

    Abstract: This article is a summary of the design in some practice and learning experiences and learning notes, want to share with you, if something wrong please correct me. Keywords: Summary of design, structural, OOD Body: On the demand clear, read

  • Heritrix source code analysis (d) shows the various categories 2010-04-01

    Heritrix class really cumbersome, often inherited layer after layer, up to 7 layers of succession seems to have. The following package of a package on a description of the role of each class, which Heritrix as distinct components, many components of

  • Heritrix source code analysis (d) each class of notes (1) 2010-04-01

    Heritrix class really cumbersome, often inherited layer after layer, up to 7 layers of succession seems to have. The following package of a package on a description of the role of each class, which Heritrix as distinct components, many components of

  • Oracle Database Backup and Recovery Summary -exp/imp (Export and Import Bank and unloading equipment library) 2010-04-20

    sqlldr userid = B / A @ D control = result.ctl log = resulthis.out rows = 10000 bindsize = 8192000 1.1 Basic commands 1. Get Help $ Exp help = y $ Imp help = y 2. 3 kinds of work (1) Interactive $ Exp / / and then press the prompt for the required pa

  • Summary of Performance Optimization of SQL SERVER (good summary, do not miss oh) 1 / 3 2010-04-25

    Improve the performance of a system Di, not only is the pilot or maintenance stages of performance tuning Ren Wu, Not just his is the development phase of the matter, Er Shi Zheng Ge software life cycle in all require attention to work effectively to

  • dwr preliminary study 2010-05-05

    dwr preliminary study: summary of some of the background data can be displayed front

  • Reposted elsewhere - by analyzing the SQL statement execution plan optimization of SQL (summary) 2010-06-30

    By analyzing the SQL statement execution plan optimization of SQL (summary) DBA did almost 7 years, and sentiment among many. In the DBA's daily work, to adjust individual performance to a less challenging when the SQL statement of work. The key lies

  • XX practice agile project summary 2010-07-01

    XX quick summary of the project practice XX took over the project and found that this project has the following risks: 1, demand uncertainty (completed until four weeks after the start of the project requirements) and full of change (and the tree of