MapReduce Design Patterns

2014-11-11  来源:本站原创  分类:Big Data and NoSQL  人气:0 

1 (总计)Summarization Patterns

1.1(数字统计)Numerical Summarizations

这个算是Built-in的,因为这就是MapReduce的模式. 相当于SQL语句里边Count/Max,WordCount也是这个的实现。

1.2(反向索引)Inverted Index Summarizations

这个看着名字很玄,其实感觉算不上模式,只能算是一种应用,并没有涉及到MapReduce的设计。其核心实质是对listof(V3)的索引处理,这是V3是一个引用Id。这个模式期望的结果是:
url-〉list of id

1.3(计数器统计)Counting with Counters

计数器很好很快,简单易用。不过代价是占用tasktracker,最重要使jobtracker的内存。所以在1.0时代建议tens,至少<100个。不过2.0时代,jobtracker变得per job,我看应该可以多用,不过它比较适合Counting这种算总数的算法。
context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1);

2 (过滤)Filtering Patterns

2.1(简单过滤)Filtering

这个也算是Built-in的,因为这就是MapReduce中Mapper如果没有Write,那么就算过滤掉

了. 相当于SQL语句里边Where。

map(key, record):
    if we want to keep record then
    emit key,value

2.2(Bloom过滤)Bloom Filtering

以前我一直不知道为什么叫BloomFilter,看了wiki后,才知道,贴过来大家瞧瞧:
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate.
其原理可以参见这篇文章:

http://blog.csdn.net/jiaomeng/article/details/1495500
要是让我一句话说,就是根据集合内容,选取多种Hash做一个bitmap,那么如果一个词的 hash落在map中,那么它有可能是,也有可能不是。但是如果它的hash不在,则它一定没有落在里边。此过滤有点意思,在HBase中得到广泛应用。接下来得实际试验一下。

Note: 需要弄程序玩玩

2.3(Top N)Top Ten

这是一个典型的计算Top的操作,类似SQL里边的top或limit,一般都是带有某条件的top

操作。
算法实现:我喜欢伪代码,一目了然:

class mapper:
    setup():
        initialize top ten sorted list

    map(key, record):
        insert record into top ten sorted list
        if length of array is greater-than 10 then
        truncate list to a length of 10

    cleanup():
        for record in top sorted ten list:
        emit null,record

class reducer:
    setup():
        initialize top ten sorted list

    reduce(key, records):
        sort records
        truncate records to top 10
        for record in records:
            emit record

2.4(排重)Distinct

这个模式也简单,就是利用MapReduce的Reduce阶段,看struct,一目了然:

map(key, record):
    emit record,null

reduce(key, records):
    emit key

3 (数据组织)Data Organization Patterns

3.1(结构化到层级化)Structured to Hierarchical

这个在算法上是join操作,在应用层面可以起到Denormalization的效果.其程序的关键之处是用到了MultipleInputs,可以引入多个Mapper,这样便于把多种Structured的或者任何格式的内容,聚合在reducer端,以前进行聚合为Hierarchical的格式.
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, CommentMapper.class);
在Map输出的时候,这里有一个小技巧,就是把输出内容按照分类,添加了前缀prefix,这样在Reduce阶段,就可以知道数据来源,以更好的进行笛卡尔乘积或者甄别操作. 从技术上讲这样节省了自己写Writable的必要,理论上,可以定义格式,来携带更多信息. 当然了,如果有特殊排序和组合需求,还是要写特殊的Writable了.
outkey.set(post.getAttribute("ParentId"));
outvalue.set("A" + value.toString());

3.2(分区法)Partitioning

这个又来了,这个是built-in,写自己的partitioner,进行定向Reducer.

3.3(装箱法)Binning

这个有点意思,类似于分区法,不过它是MapSide Only的,效率较高,不过产生的结果可能需

要进一步merge.
The SPLIT operation in Pig implements this pattern.
具体实现上还是使用了MultipleOutputs.addNamedOutput().

// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs

MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled

MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);

3.4(全排序)Total Order Sorting

这个在Hadoop部分已经详细描述过了,略。

3.5(洗牌)Shuffling

这个的精髓在于随机key的创建。
outkey.set(rndm.nextInt());
context.write(outkey, outvalue);

4 (连接)Join Patterns

4.1(Reduce连接)Reduce Side Join

这个比较简单,Structured to Hierarchical中已经讲过了。

4.2(Mapside连接)Replicated Join

Mapside连接效率较高,但是需要把较小的数据集进行设置到distributeCache,然后把

另一份数据进入map,在map中完成连接。

4.3(组合连接)Composite Join

这种模式也是MapSide的join,而且可以进行两个大数据集的join,然而,它有一个限制就是两个数据集必须是相同组织形式的,那么何谓相同组织形式呢?
• An inner or full outer join is desired.
• All the data sets are sufficiently large.
• All data sets can be read with the foreign key as the input key to the mapper.
All data sets have the same number of partitions.
• Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. That is, partition X of data sets A and B contain
the same foreign keys and these foreign keys are present only in partition X. For a visualization of this partitioning and sorting key, refer to Figure 5-3.
• The data sets do not change often (if they have to be prepared).

// The composite input format join expression will set how the records
// are going to be read in, and in what input format.
conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType,
KeyValueTextInputFormat.class, userPath, commentPath));

4.4(笛卡尔)Cartesian Product

这个需要重写InputFormat,以便两部分数据可以在record级别联合起来。sample略。

5 (元模式)MetaPatterns

5.1(链式Job)Job Chaining

多种方式,可以写在driver里边,也可以用bash脚本调用。hadoop也提供了JobControl

可以跟踪失败的job等好的功能。

5.2(折叠Job)Chain Folding

ChainMapper and ChainReducer Approach,M+R*M

5.3(合并Job)Job Merging

合并job,就是把同数据的两个job的mapper和reducer代码级别的合并,这样可以省去

I/O和解析的时间。

6 (输入输出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

  • InputFormat
    getSplits
    createRecordReader
  • InputSplit
    getLength()
    getLocations()
  • RecordReader
    initialize
    getCurrentKey and getCurrentValue
    nextKeyValue
    getProgress
    close
  • OutputFormat
    checkOutputSpecs
    getRecordWriter
    getOutputCommiter
  • RecordWriter
    write
    close

6.2 (产生Random数据)Generating Data

关键点:构建虚假的InputSplit,这个不像FileInputSplit基于block,只能去骗hadoop了。

6.3 External Source Output
6.4 External Source Input
6.5 Partition Pruning 剪除杂生

上面三个模式,可以用一个例子来,这个例子很有意思,就是读取Redis数据,运行MapReduce,输出到Redis,我想这个是大多数应用可以借鉴的。

Note: 需要弄程序玩玩.


(趋势)Trends in the Nature of Data

Images, Audio, and Video
Streaming Data

相关文章
  • MapReduce Design Patterns 2014-11-11

    1 (总计)Summarization Patterns 1.1(数字统计)Numerical Summarizations 这个算是Built-in的,因为这就是MapReduce的模式. 相当于SQL语句里边Count/Max,WordCount也是这个的实现. 1.2(反向索引)Inverted Index Summarizations 这个看着名字很玄,其实感觉算不上模式,只能算是一种应用,并没有涉及到MapReduce的设计.其核心实质是对listof(V3)的索引处理,这是V3是一个

  • The role of design patterns 2009-04-23

    Excellent system to build a most difficult thing is not the coding (coding), but in early to make the design (design) on the decision. Designed software development life cycle are the key stage, good design can produce good products, but not when the

  • Why Study Design Patterns 2009-04-23

    Maybe someone will ask: "Why should we study the design model?" There is a lot of reasons, some very obvious, while others are not so obvious. Learn mode The most common reason is because we can borrow it: ● Multiplexing Solutions - has been rec

  • How in practice the use of design patterns free 2009-04-23

    Designed object-oriented programming model is a hot topic, one of many more and more developers who recognize the importance of design patterns. Implementation using design patterns in various languages are also more and more of the article, but a lo

  • Their understanding of the J2EE three-tier architecture (design patterns and software differences between the contact) 2009-06-18

    As Figure 3 layer 1.J2EE points: Server-side business logic (Business Logic Tier, and there is persistent data layers, Businness Tier and the EIS Tier), server-side presentation layer (Web Tier) and express the client layer (Client Tier) J2EE design

  • Their understanding of the J2EE three-tier architecture (design patterns and software differences between the links) 2009-06-21

    As Figure 3 layer 1.J2EE points: Server-side business logic (with Business Logic Tier and data layers of durable, Businness Tier and the EIS Tier), server-side presentation layer (Web Tier), and the client presentation layer (Client Tier) J2EE design

  • Notes Head First Design Patterns 2009-07-15

    Author: water tree reprint URL: http://mifunny.info/ Head First Design Patterns first lesson: strategy models Just look at "Head First Design Patterns", chapter I talk about a story: Joe designed the designer ducks, but the Board of Directors (E

  • Design Pattern Study Notes 7: summary of the principle of common design patterns 2009-10-09

    The front part of the creation of a learning model and found that a design is more important than anything: the principle of design patterns. Design patterns for example, why this model to solve the problem this way, the other mode should it be done,

  • "Head First Design Patterns" Design Principles of collation 2009-10-12

    I will study the following course of the principle of access to the contents of each listed below for review and sharing. 1, applications may be needed to identify changes, and it changes them and do not need to distinguish between the code On this p

  • JUnit framework for design and use of design patterns 2009-10-27

    JUnit framework for design and use of design patterns Translation: Yong-Jun Hu hu.yong.jun @ ctgpc.com.cn be added 〔〕 Original: JUnit A Cook's Tour, see www.junit.org JUnit framework for design and use of the Mode 1 Translation: Yong-Jun Hu hu.yong.j

  • headfirst design patterns 2009-11-10

    Spent a little more than a week after reading a headfirst design patterns. After reading the feeling of a sudden, carefully feel like there is nothing to understand. To put it simply this book under the perception that it, headfirst relatively easy t

  • GoF design patterns Xiangjie 2010-02-08

    It can be said: GoF design pattern is an object-oriented programmers truly grasp the core idea of a required course. Although you may have passed a lot of SUN is dazzling technology certification, but if you do not study and master the GoF design pat

  • Design Patterns: Calculator Simple Implementation 2010-02-17

    If we are to write a calculator is not difficult: the following code public class MainClass { /** * @param args */ public static void main(String[] args) { // 1, The receiving client input System.out.println("----- Calculator program -----"); Sy

  • Design patterns seen by the (factory mode Factory) 2010-03-29

    In fact, we have been in development do not know the design patterns used in many ways, but some we also know specifically which model it with. Such as the factory pattern, single cases, adaptive, appearance models and some other common models. Here

  • Design Patterns as seen (decorator decorator) 2010-03-29

    Header: (from Design Patterns) To extend the code base, usually give it to add a new class or new methods. Sometimes, you might not want to run when the combination of the object using the new behavior. Interpreter pattern allows you to mix an execut

  • Design patterns seen by the (agent mode Proxy) 2010-03-29

    Common object need to complete the task is for the outside world through a common interface to provide its promised services. Sometimes, however, may be a legitimate target because of various reasons, unable to complete their routine tasks. Especiall

  • Java Design Patterns Adaptor Adapter (reference) 2010-03-29

    Font size: Big Medium Small Java Design Patterns Adaptor adapter. Definition: The two are not compatible with the class gathered together and use, are structured model, the need for Adaptee (those who are fit), and Adaptor (Adapter) in two different

  • java basic design principles and common design patterns notes 2010-03-29

    The use of basic design principles and design patterns reasons: Reuse and maintainable design principles: 1. Open - Closed Principle (ocp): a software entity should be extended, opening up (the realization of the underlying business should be flexibl

  • Design Patterns as seen (Interpreter Mode Interpreter) 2010-03-29

    Interpreter pattern is a more difficult to understand the model, but if you Command (command mode) and Composite (combined mode) is to understand the words, you will find that the Interpreter pattern is a combination of both. Why do you want to use t

  • Design patterns seen by the (strategic model strategy) 2010-03-29

    java design patterns, wrote Strategy pattern in a given input conditions, to achieve a target of the plan or program. Strategies and algorithms are similar; algorithm is a defined process, it can provide a set of input produces an output. The strateg