This is the Internet to find a Baidu faces questions:
Search engine each time a user through the log files to retrieve all the search strings used are recorded, the length of each query string is 1-255 bytes. Suppose there are ten million records, the query string of repeatability is relatively high, although a total of 1 million, but if, after removing duplicate, not more than 3 million. Repeat a query string the higher the degree, that the more users query it, which is more popular. Please statistics 10 most popular query string, the memory required no more than 1G.
】 【Analysis: To count the most popular queries, the first was to statistics the number of occurrences of each Query, and then according to statistics, find the Top 10. So we can be designed based on the idea of the algorithm in two steps. The following two steps of the algorithm are given:
The first step: Query statistics
Algorithm A: Direct sequencing method
First, we can think of is the sort algorithm, first of all inside of this log are sorted Query, and then traverse the sorted Query, Query statistics the number of occurrences of each. But the subject has explicitly requested that the memory can not exceed 1G, ten million records, each record is 225Byte, it is clear to take 2.55G memory, this condition does not meet the requirements of the.
Let us recall the contents of data structures course, when the data and the memory can not be larger than the hold time, we can use external sorting methods to sort, where I use merge sort, because there is a better merge sort time complexity O (NlgN).
Drained after the order has ordered us to the Query file traversal, statistics the number of occurrences of each Query again written to the file.
Comprehensive analysis of what, sort time complexity is O (NlgN), while traversing the time complexity is O (N), so the overall time complexity of the algorithm is O (NlgN).
Algorithm II: Hash Table Method
In the previous method, we used statistical methods to sort the number of occurrences of each Query, the time complexity is NlgN, you can not have better ways to store, and time complexity lower it?
The title shows, although there are ten million Query, but because of relatively high degree of repetition, so the fact that only 300 million of the Query, each Query255Byte, so we can consider them all to go into the memory, and now just need a suitable data structure, where, Hash Table is definitely our preferred choice, because the Hash Table in query speed is very fast, almost O (1) time complexity.
So, our algorithm will have: Query string for the maintenance of a Key, Value for Query number of occurrences of the HashTable, each read a Query, if the string is not in Table, then add the string, and will Value value is set to 1; if the string in the Table, then the string can count plus one. We finally O (N) time complexity completed within the massive data processing.
This method is a comparison algorithm: to improve the time complexity an order of magnitude, but not only on the optimal time complexity, this method requires only one data file IO, and one of the IO algorithm more frequently, so the algorithm in engineering than the algorithm has a better maneuverability.
Step two: find the Top 10
Algorithms I: Sorting
I think the sort algorithm has been no stranger to everyone, here is not to say, we should note that the time complexity of sorting algorithm is NlgN, in this subject, the three million records, with 1G of memory can be saved in.
Algorithm II: some sort
Subject requirement is calculated Top 10, so we do not need to have to sort all of the Query, we only need to maintain a size 10 array initialization into 10Query, according to statistics of the number of each Query Ranking and then traverse the 300 million records, each record to and array reading of a final Query contrast, if less than this Query, then continue to traverse, otherwise, the array of data out of the last to join the current Query. Finally, when all the data traversal finished, then the array is 10 Query we are looking for the Top10.
Easy to see, so the time complexity of the algorithm is N * K, where K is the top number.
Algorithm III: Reactor
In algorithm II, we have the time complexity from NlogN optimized to NK, have to say this is a relatively big improvement, but there is no better solutions?
Analyze the algorithm II, after the completion of each comparison, the complexity of the operation required is K, as should elements being inserted into a linear form, and uses a sequential comparison. Here we note that the array is ordered, the first time every time we find the method can be used to find two points, the complexity of this operation dropped to the logK, however, the attendant problem is that data movement, because movement increased the number of data. However, this algorithm is better than Algorithm II has been improved.
Based on the above analysis, we think both have a quick look, but also fast-moving elements of the data structure? The answer is yes, it is heap.
With the heap structure, we can find the log magnitude of time and adjust / move. So here, our algorithm can be improved as such, maintaining a K (the title is 10) the size of the heap of small roots, and then traverse the 300 million of the Query, respectively, and compared the root element. . .
So this way, the algorithm complexity to development time down to NlogK, and algorithms than there has been a large improvement.
At this point, our algorithm is completely over, through steps one and two of the optimal combination of our final time complexity is O (N) + O (N ') logK. If you have any good algorithm, welcome to the discussion thread.