hadoop II study notes: MapReduce basic programming

2010-12-22  来源:本站原创  分类:Internet  人气:90 

Be sure to indicate the source reproduced Taobao QA Team , the original address: http://qa.taobao.com/?p=10523

Introduction <br /> in this series last article introduced the basic concept and structure of Hadoop, this article demonstrates an example of MapReduce by the basic programming. Hope that before proceeding under review the contents of the front, at least to understand hadoop II study notes: MapReduce basic programming

How it was.


  • Create a maven project and add the hadoop dependent
  • We use maven to manage the project, with their favorite m2eclipse plug- in created in the eclipse or the command line to create a project. Hadoop in the pom.xml to add dependency.

    Run mvn eclipse: eclipse command, it will project into eclipse, can see the following related hadoop II study notes: MapReduce basic programming


    Ok, now our first MapReduce program, use this program to achieve word count function.

  • Overview
  • A simple MapReduce program needs three things
    1. To achieve Mapper, processes the input pair, output of intermediate results
    2. To achieve Reduce, operation of the intermediate results and output the final result
    3. In the main method defined in the running job, definition of a job, control job here, how to run and so on.
  • Map class writing
  • public  class WordCountMapper
           extends Mapper{
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
          StringTokenizer itr = new StringTokenizer(value.toString());
          while (itr.hasMoreTokens()) {
            context.write(word, one);

    Mapper interface is a generic, there are four forms of argument types, specify the map function of the input keys, input, output key, the output value. To the above example, the enter key is not used (the actual text in the cell line representative of the position, there is no need for this, so we ignore), the input value is the same text, the output key is the word, the output value is an integer representative of the word the number of occurrences. Note that Hadoop provides its own set of sequences can be used to optimize the basic types of networks, rather than using the built-in java types, which are defined in the package in org.apache.hadoop.io, Text types used above is equivalent to java's String type, IntWritable type is equivalent to java's Integer type. In addition, do not see any details of distributed programming, everything is so simple.

  • Reduce class writing
  • public class WordCountReducer extends
                    Reducer {
            private IntWritable result = new IntWritable();
            public void reduce(Text key, Iterable values, Context context)
                            throws IOException, InterruptedException {
                    int sum = 0;
                    for (IntWritable val : values) {
                            sum += val.get();
                    context.write(key, result);

    Similarly, Reducer interface formal parameter type specifies four functions reduce the input and output types. In the above example, the Enter key is the word, the input value is the number of occurrences of the word, the number of times the word appears to overlap, the total number of output words, and words.

  • Definition of job
  • public class WordCount {
      public static void main(String[] args) throws Exception {
         Configuration conf = new Configuration();
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
             if (otherArgs.length != 2) {
               System.err.println("Usage: wordcount  ");
             /** Create a job, a name for the trace to view task performance **/
             Job job = new Job(conf, "word count");
             /** When run on hadoop cluster operations , Need to put the code into a jar file (hadoop Will distribute this file clusters ), Through job's setJarByClass Sets a class, the class find the hadoop is located jar File **/
            /** Set the type to use map.combiner.reduce **/
           /** Set map and reduce Function of the input type, there is no code is because we use the default TextInputFormat, For a text file, follow rows are cut into a text file  InputSplits,  And LineRecordReader will  InputSplit  Resolved to  <key,value&gt:  On the key is in the position of the file ,value  A line in the file **/
            /** Set map and reduce Function of the output key and output of value types **/
            /** Sets the input and output path **/
            FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
           /** Submit job and wait for it to finish **/
            System.exit(job.waitForCompletion(true) ? 0 : 1);

    Basically, to complete a MapReduce program to simple and complex job that has a complex configuration attribute parameter, such as file segmentation strategy, sorting strategy, map out the memory buffer size, number of worker threads, to better understand and grasp these parameters to make their MapReduce programs running in the cluster environment the best.


    This paper describes an example of the basic MapReduce programming model, hope that this will deepen the understanding of MapReduce, will be introduced later in the article how to test and how the MapReduce job running.

  • hadoop II study notes: MapReduce basic programming 2010-12-22

    Be sure to indicate the source reproduced Taobao QA Team , the original address: http://qa.taobao.com/?p=10523 Introduction <br /> in this series last article introduced the basic concept and structure of Hadoop, this article demonstrates an example

  • Extreme programming study notes - the basic practice of "sit down together." 2010-02-18

    Extreme programming study notes - the basic practice of "sit down together." Sit down together Write larger, sit down together This practice seems relatively easy to understand and implement, but it is a very important XP practices, communicatio

  • hadoop study notes of the two: MapReduce basic programming 2010-12-22

    Reproduced sure to indicate the source Taobao QA Team , the original address: http://qa.taobao.com/?p=10523 Introduction <br /> In this series of articles on Hadoop described the basic concepts and architecture, this article will demonstrate an exam

  • Oracle Study Notes (2) PLSQL Programming Fundamentals 2010-03-14

    This is the second chapter of the study notes, study after completion of chapter basis, from now to learn Oracle programming the ... ..., I hope that we can give some support ah! Programming tools is PLSQL Developer 7.1.4 select * from Employee; sele

  • Javascript advanced Programming "(2nd Edition) study notes 02 - Basic Packaging 2010-09-06


  • Large-scale data mining - Chapter II study notes 2011-05-02

    3.4 Locality sensitive hashing the document (Locality-Sensitive Hashing for Documents) Although we can minhash to compress large documents to a small sign, and still to retain every document similarity. But the finding of similar remains difficult to

  • GIT Study Notes 1 - Basic Use 2011-01-30

    Keywords: git , version management Thanks linux Dapeng "see diary study git" series of tutorials Git is a Linus Stowe Schwarz in order to better manage the linux kernel development and the creation of a distributed version control / software con

  • ROR study notes 1 (basic - Installation - configuration) 2009-03-21

    Recent company projects a little space that you intend to take the time to play ROR Start now, to learn while finishing Oh .................. 1. Basic concept of 1.ruby is a dynamic object-oriented scripting language, grammar and flexible, providing

  • Oracle study notes 1 - Basic knowledge 2010-02-28

    Current mainstream database: Microsoft: sql server and access Sweden MySql: AB Company mysql IBM Corporation: DB2, Sybase, the United States: Sybase IBM Corporation: informix United States oracle Company: oracle For oracle jobs that demand is quite b

  • selenium study notes 5 --- basic operation 2010-07-14

    The use of selenium in the java class for some of the basic operation of web testing as follows: 1, open the page: selenium.open (" http://www.xxx.com.cn/ ");// use of open time, without calling for waitForPageToLoad 2, wait for the page: seleni

  • JVM study notes (the basic structure of a )------ 2010-09-29

    From the logical structure of the Java platform, we can learn from the following diagram to JVM: Can clearly see from the above Java platform contains all logic modules can also be learned the difference between JDK and JRE The physical structure of

  • [Ruby study notes] 5. Basic expressions 2010-12-09

    1. Basic expressions ~~~~~ Nothing to say The only caveat is that Ruby does not support the self-plus (+ +) and decrement (--), can use the + = and -= instead. 1 + 2 a = 3 a += 4 2. Command to expand In Ruby, we can use the backquote (`) execute syst

  • Oracle 10g SQL Fundamentals II (study notes two Section 1-2) 2011-10-17

    Chapter user access Create a user create user user identified by passwd; Authorize grant privilege to user; grant create session,create table,create sequence,create view to scott; Create a role create role manager; grant create table,create view to m

  • Oracle 10g SQL Fundamentals II (study notes two Section 3-4) 2011-10-17

    Chapter III data collection Copies data from a table insert into sales(id,name,salary,commission_pct) select employee_id,last_name,salary from employees where job_id like '%REP%'; Use a subquery as the Insert destination INSERT INTO (SELECT employee_

  • Hibernate II Study Notes 2009-02-28

    11. Many-to-many Of many that can be converted to two one-to-many <set name="students" table="teacher_student"> <key column="techer_id"/> <many-to-many column="student_id"/> </set> many-to-ma

  • <JAVA Tcp/ip socket> Chapter II study notes 2010-03-21

    close () method to close the socket and its associated input and output stream, thereby preventing their further action. shutDownInput () method close the TCP stream input, any read data will be discarded, including those who have been sockets cached

  • oracle Study Notes 8 - basic management database 2010-03-06

    Each oracle database should have at least one database administrator (DBA), for a small database, a DBA is enough, but for a large database may require multiple DBA, are responsible for various management responsibilities. Then a database administrat

  • <head first> II Study Notes - Observer pattern 2010-09-29

    Popular understanding of what the Observer pattern: when a subject observed object is changed, then the observed objects are followed by all of its updates. Scenarios such as weather information change, along with a variety of display devices are upd

  • oracle study notes two (basic date functions) 2010-12-22

    Date of function: sysdate: Returns the system time add_months (d, n); Find more than eight months of the entry of employees: select * from emp where sysdate> add_months (hiredate, 8); add_months (hiredate, 8): that from the beginning of time employed

  • Oracle 10g SQL Fundamentals II (study notes two chapters 5-6) 2011-10-17

    Chapter fifth time zones management data TIME_ZONE The session parameter value A relative value : ALTER SESSION SET TIME_ZONE = '-05:00'; Database time zone : ALTER SESSION SET TIME_ZONE = dbtimezone; The operating system time zone : ALTER SESSION SE