(Favorite) Top K algorithm detailed analysis --- Baidu interview

2010-10-06  来源:本站原创  分类:Tech  人气:142 

Problem description:

This is the Internet to find a Baidu faces questions:
Search engine each time a user through the log files to retrieve all the search strings used are recorded, the length of each query string is 1-255 bytes. Suppose there are ten million records, the query string of repeatability is relatively high, although a total of 1 million, but if, after removing duplicate, not more than 3 million. Repeat a query string the higher the degree, that the more users query it, which is more popular. Please statistics 10 most popular query string, the memory required no more than 1G.



Problem Analysis:

】 【Analysis: To count the most popular queries, the first was to statistics the number of occurrences of each Query, and then according to statistics, find the Top 10. So we can be designed based on the idea of the algorithm in two steps. The following two steps of the algorithm are given:


The first step: Query statistics

Algorithm A: Direct sequencing method

First, we can think of is the sort algorithm, first of all inside of this log are sorted Query, and then traverse the sorted Query, Query statistics the number of occurrences of each. But the subject has explicitly requested that the memory can not exceed 1G, ten million records, each record is 225Byte, it is clear to take 2.55G memory, this condition does not meet the requirements of the.

Let us recall the contents of data structures course, when the data and the memory can not be larger than the hold time, we can use external sorting methods to sort, where I use merge sort, because there is a better merge sort time complexity O (NlgN).

Drained after the order has ordered us to the Query file traversal, statistics the number of occurrences of each Query again written to the file.

Comprehensive analysis of what, sort time complexity is O (NlgN), while traversing the time complexity is O (N), so the overall time complexity of the algorithm is O (NlgN).

Algorithm II: Hash Table Method

In the previous method, we used statistical methods to sort the number of occurrences of each Query, the time complexity is NlgN, you can not have better ways to store, and time complexity lower it?
The title shows, although there are ten million Query, but because of relatively high degree of repetition, so the fact that only 300 million of the Query, each Query255Byte, so we can consider them all to go into the memory, and now just need a suitable data structure, where, Hash Table is definitely our preferred choice, because the Hash Table in query speed is very fast, almost O (1) time complexity.
So, our algorithm will have: Query string for the maintenance of a Key, Value for Query number of occurrences of the HashTable, each read a Query, if the string is not in Table, then add the string, and will Value value is set to 1; if the string in the Table, then the string can count plus one. We finally O (N) time complexity completed within the massive data processing.
This method is a comparison algorithm: to improve the time complexity an order of magnitude, but not only on the optimal time complexity, this method requires only one data file IO, and one of the IO algorithm more frequently, so the algorithm in engineering than the algorithm has a better maneuverability.


Step two: find the Top 10

Algorithms I: Sorting

I think the sort algorithm has been no stranger to everyone, here is not to say, we should note that the time complexity of sorting algorithm is NlgN, in this subject, the three million records, with 1G of memory can be saved in.

Algorithm II: some sort

Subject requirement is calculated Top 10, so we do not need to have to sort all of the Query, we only need to maintain a size 10 array initialization into 10Query, according to statistics of the number of each Query Ranking and then traverse the 300 million records, each record to and array reading of a final Query contrast, if less than this Query, then continue to traverse, otherwise, the array of data out of the last to join the current Query. Finally, when all the data traversal finished, then the array is 10 Query we are looking for the Top10.
Easy to see, so the time complexity of the algorithm is N * K, where K is the top number.

Algorithm III: Reactor

In algorithm II, we have the time complexity from NlogN optimized to NK, have to say this is a relatively big improvement, but there is no better solutions?
Analyze the algorithm II, after the completion of each comparison, the complexity of the operation required is K, as should elements being inserted into a linear form, and uses a sequential comparison. Here we note that the array is ordered, the first time every time we find the method can be used to find two points, the complexity of this operation dropped to the logK, however, the attendant problem is that data movement, because movement increased the number of data. However, this algorithm is better than Algorithm II has been improved.
Based on the above analysis, we think both have a quick look, but also fast-moving elements of the data structure? The answer is yes, it is heap.
With the heap structure, we can find the log magnitude of time and adjust / move. So here, our algorithm can be improved as such, maintaining a K (the title is 10) the size of the heap of small roots, and then traverse the 300 million of the Query, respectively, and compared the root element. . .
So this way, the algorithm complexity to development time down to NlogK, and algorithms than there has been a large improvement.


Conclusion:

At this point, our algorithm is completely over, through steps one and two of the optimal combination of our final time complexity is O (N) + O (N ') logK. If you have any good algorithm, welcome to the discussion thread.

相关文章
  • (Favorite) Top K algorithm detailed analysis --- Baidu interview 2010-10-06

    Problem description: This is the Internet to find a Baidu faces questions: Search engine each time a user through the log files to retrieve all the search strings used are recorded, the length of each query string is 1-255 bytes. Suppose there are te

  • linux top command arguments detailed under 2010-04-14

    linux top command arguments detailed under the Transfer from the Internet top command is commonly used under Linux performance analysis tools, real-time display system resource usage status of each process, similar to Windows Task Manager. The follow

  • PS command Linux operating system, detailed analysis 2010-10-10

    PS command Linux operating system, detailed analysis To monitor and control processes in the system, with the ps command to meet you. / Bin / ps ps trip is to show transient state, not dynamic continuous; If you want to run-time monitoring of the pro

  • Linux operating system commands detailed analysis PS 2010-10-10

    Linux operating system commands detailed analysis PS To monitor the process control system, use the ps command to meet you. / Bin / ps ps trip is to show transient state, not dynamic continuous; If you want to run-time monitoring of the process, shou

  • RDP protocol detailed analysis (1) 2010-03-13

    RDP protocol detailed analysis 1 Introduction 2 Overview Three of a Kind Contact level 4 shows the connection module 5 shows the functional modules 6 rdpwin structure, data flow description 7 Summary I. Introduction windows started to provide termina

  • Android GSM drive module (rild) detailed analysis of (a) basic structure and initialization 2010-09-06

    Android GSM Drive Module (rild) detailed analysis of (a) basic structure and initialization Panda brother and Opendroid reproduced on IT168 Please specify Android's RIL driver module, the hardware / ril directory was divided into rild, libril.so and

  • Android GSM drive module (rild) detailed analysis of (c) response process 2010-09-06

    Android GSM drive module (rild) detailed analysis of (c) response processes Panda brother published in IT168 and Opendroid reproduced please specify Earlier analysis of the request to terminate the writeline in the at_send_command_full_nolock in oper

  • 程序员编程艺术:第三章续.Top K算法问题的实现 2014-12-26

    前奏 在上一篇文章,程序员面试题狂想曲:第三章.寻找最小的k个数中,后来为了论证类似快速排序中partition的方法在最坏情况下,能在O(N)的时间复杂度内找到最小的k个数,而前前后后updated了10余次.所谓功夫不负苦心人,终于得到了一个想要的结果. 简单总结如下(详情,请参考原文第三章): 1.RANDOMIZED-SELECT,以序列中随机选取一个元素作为主元,可达到线性期望时间O(N)的复杂度. 2.SELECT,快速选择算法,以序列中"五分化中项的中项",或"

  • Finding the top K items in a list efficiently 2012-09-04

    Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or

  • Detailed analysis of 14 kinds of software testing types 2009-09-06

    Detailed analysis of 14 kinds of software testing types

  • RDP protocol detailed analysis (5) 2010-03-13

    Five functional modules Description: Mission marks Note: applies to all non-graphics-channel data. 00 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 00,000,300 Transmission Transfer start End of Transmission Transmission control Transmission Feedback 1

  • MySQL grant syntax detailed analysis 2010-07-13

    The following article is the MySQL grant syntax detailed analysis, if you are related to the MySQL grant syntax of practical interest, you can click the following article viewed. We all know the MySQL database user rights given to the simple format o

  • Android GSM drive module (rild) detailed analysis of (b) request process 2010-09-06

    Android GSM drive module (rild) detailed analysis of (b) request process Panda brother published in IT168 and Opendroid reproduced please specify 1. Multiplexed I / O operation of the mechanism mentioned above request is received, the multiplexer thr

  • Fully instantiated WebService detailed analysis 2011-01-09

    Fully instantiated WebService detailed analysis 2010-12-19 21:47:08 Source: gjrencai.com View: 957 times First, we must understand what is webservice. Conceptually speaking, may be more complicated, but we can have a macro understanding: webservice i

  • Snapshot immediately returned to normal 4 months and finally, real analysis Baidu speed of response of 301 2011-03-15

    301 friends have done with Google Baidu will understand the reaction speed is very different, Google can recognize very short time and update data. And Baidu is the same as with the older woman is very slow, I had a station mmtime at the beginning of

  • Detailed analysis with Squid reverse proxy method to achieve 2011-05-06

    from http://tech.ccidnet.com/art/737/20070417/1063859_1.html Detailed analysis with Squid reverse proxy method to achieve Posted: 2007.04.18 06:13 Source: SAN technology community Author: forgiven A proxy server is a very common use will be accessing

  • [Transfer] svnserve configuration file detailed analysis 2011-03-26

    svnserve is a light that comes with SVN server, the client by using the svn: / / or svn + ssh: / / prefix the URL to access the svnserve server, remote access SVN repository. svnserve configuration file can be set by the user and password, as well as

  • 基于堆实现的优先级队列:PriorityQueue 解决 Top K 问题 2014-09-10

    1.认识PriorityQueue PriorityQueue是从JDK1.5开始提供的新的数据结构接口,它是一种基于优先级堆的极大优先级队列.优先级队列是不同于先进先出队列的另一种队列.每次从队列中取出的是具有最高优先权的元素.如果不提供Comparator的话,优先队列中元素默认按自然顺序排列,也就是数字默认是小的在队列头,字符串则按字典序排列(参阅 Comparable),也可以根据 Comparator 来指定,这取决于使用哪种构造方法.优先级队列不允许 null 元素.依靠自然排序的优

  • Hive中实现Group By后,取Top K条记录 2015-05-02

    RT,在Hive中,使用了Group By后,是无法再sort,再取Top K的,我们可以用UDF + distributed by + sort by 实现这个功能. 参考自:EXTRACT TOP N RECORDS IN EACH GROUP IN HADOOP/HIVE Assume you have a table with three columns: user, category and value. For each user, you want to select top N

  • PHP binary search algorithm and a detailed analysis of the sample 2010-08-30

    Write himself a binary search function is php achieved. I think many people in the interview, they also encountered this problem. Well, crap too much, go! /** * Two algorithms to find * @param array $array To find an array of * @param int $min_key Th