Beauty of Mathematics Series 5 - Simple Beauty: Boolean algebra and the search engine index

2010-03-26  来源:本站原创  分类:Tech  人气:223 

Author: Wu Jun, Google Fellow

[Create a search engine generally need to do this a few things: automatically download as many pages; a rapid and effective index; pages based on relevance to the fair and accurate sorting. We are introducing the Google Page Rank (page rank) had already talked about some sort of problem, where we talk about the index issue, since we talk how to measure page relevance, and to automatically download page. ]

Binary world, there can be no easier than counting method, and can not have Beable computing simpler operation of the. While each search engine, today announced how they are clever, how intelligent, in fact, is fundamentally not escape the confines of Boolean operations.

Boolean (George Boole) is a nineteenth-century England primary school math teacher. His lifetime, he did not think he was a mathematician. Bull in her spare time, like reading a mathematical theory to think about mathematical problems. 1854 " Laws of Thought "(An Investigation of the Laws of Thought, on which are founded the Mathematical Theories of Logic and Probabilities) book, the first to demonstrate how to use mathematical methods to solve logic problems.

Simple Boolean algebra could not be more simple. Computing elements of only two 1 (TRUE, true) and 0
(FALSE, false). Basic computing only "and" (AND), "or" (OR) and "non" (NOT) of three (later found that these three operations can be converted into "and" "non" AND-NOT an algorithm ). All operations can only complete the following truth table describes a few clear.

AND | 1 0
1 | 1 0
0 | 0 0
This table shows the AND operation of two elements, if a is 0, then the operation result is always 0. If the two elements are 1, computing the result is 1. For example, "the sun rises from the west," the judge is false (0), "Water can flow," the judge is true (1), then, "the sun rising from the west and the water can flow" is false (0 ).

OR | 1 0
1 | 1 1
0 | 1 0
This table shows the OR operation of two elements, if a is 1, the operation result is always 1. If the two elements are 0, computing the result is 0. For example, "Joe Smith is the first game," this conclusion is false (0), "John Doe is the first game," is true (1), then "Joe Smith or John Doe is the first" is true (1).

1 | 0
0 | 1
This table shows the NOT operator to 1 to 0, the 0 to 1. For example, if the "ivory is white" is true (1), then the "ivory than white" must be false (0).

Readers may ask such a simple theory can actually do anything. Boolean contemporary mathematicians have the same problem. In fact, in Boolean algebra in 80 years after the presentation, it really is not any decent application, until Shannon in his 1938 master's thesis that Boolean algebra to implement switching circuit, which makes the Boolean algebra as the basis for digital circuit. All mathematical and logical operations, addition, subtraction, multiplication, division, involution, evolution, etc., all can be converted into a binary Boolean operation.

Now we look at literature searching and Boolean relations. For a user to enter the keywords, the search engine to determine whether each document containing the key words, if a document containing it, and we accordingly to a logical value of this literature - truth (TRUE, or 1), otherwise, to a logical value - false (FALSE, or 0). For instance, we find the relevant literature on the application of atomic energy, but does not want to know. . . . We can write such a query, "Atomic Energy AND application AND (NOT the atomic bomb)", that meet the requirements of the documents must meet three conditions:
- Include atomic energy
- Contains application
- Does not contain an atomic bomb literature for each of the conditions above, there is a True or False answer, according to the truth table of each article will be able to figure out whether it is looking for.

The early literature search based on database query system mostly, strict compliance Boolean query. Today's search engine compared to more than smart, it automatically to the user's query into Boolean formula. Of course, when the query can not be scanned and once each to see if it met three conditions above, the need to establish an index.

The simplest structure of the index is a long binary number that a keyword appears in each literature. How many articles, there are that many digits, each corresponding to a document, a representative of the corresponding documents have this keyword, 0 for no. For example the keyword "atomic energy" corresponds to the binary number is 0100100001100001 ..., said second, fifth, ninth, tenth, 16th articles contain keywords. Note that this binary is very long. Similarly, we assume that "application" corresponding to the binary number is 0010100110000001 .... Then to find that contain both "atomic energy" and "application" of literature, as long as the two binary Boolean operation AND. According to the above truth table, we know that computation result is 0000100000000001 .... That the fifth chapter, 16th articles meet the requirements.

Note that the computer for Boolean operation is very, very fast. Now the cheapest PC can be a Boolean operation to 32, a second to billions of times. Of course, the vast majority of these binary digits are zero, we only need to record the number of bits that can be equal to 1. Thus, the search engine index becomes a large table: each row of the table corresponds to a keyword, and each one of the words followed by a number of documents that contain the keyword serial number.

In terms of the Internet search engine, every page is a document. Internet is a huge number of pages, the network also used the word very, very much. Therefore, this index is huge, in the trillions of bytes of this magnitude. Early search engines (eg Alta Vista search engine, all the previous), due to computer speed and capacity constraints, only the key to the keywords on the major indexes. Many academic journals have also requested to provide 3-5 key words on. This is not common to all too common words and function words to be found. Now, in order to ensure that any web search can provide all the search engines are indexing all the words. Page Rank for convenience, the index needs to have a lot of additional information, such as the occurrence of each word, number, etc.. Therefore, the index becomes very large, so not able to save with a computer. There was general practice is under the page number to index into a lot of copies (Shards), were stored in a different server. When receiving a query, the query was distributed to many servers, these servers in parallel processing user requests, and the results sent to the master server consolidation, and finally the results returned to the user.

No matter how complex the index, find the basic operation remains the Boolean operation. Boolean logic and mathematics to link up. Its biggest advantage is easy, fast, this search for the flood of information is essential. It is the only given is insufficient to judge whether or not, but can not give quantitative measure. Therefore, all search engines in-house search after, must meet the requirements of the website according to relevance, and then returned to the user.