Google search technology, illustration

Google (Google), a very successful, but very mysterious, and somewhat idealized color with Internet search giant, it is a very great advertising company, Google search button on the home page that is 20 billion U.S. dollars of annual profit the killer application, but also Internet's leading business and technology myths. Recently, a foreign Web site (PPCblog) carefully drew a flow chart of Google searches, this chart display with 3 million times every day traffic to the search engine behind Google Search button in that the response time of less than 1 second to within processing.

Before you click the Google Search button to see the results that less than 1 second of time, it do? Content on the Internet how to find Google? What kind of content will be included? Surely everyone must want to know Google The secrets behind the search button. Do not worry, we first look at the before the start of the mysterious Google data centers.

Design of Google's own servers

Google's data centers highly confidential, we can get the information is very limited. Let's look at some data: Google data centers in the United States more than 19, and another 17 outside in the United States around the world; Each data center has 50 million square feet (46,450 m2), construction of a data Centre to spend about 6 billion U.S. dollars; Google's data center is the world's one of the most efficient facilities, very green; data center 50-100 megawatts of electricity, taking into account the cooling problem, usually built in place to facilitate water; Google The servers are housed in standard shipping containers, each container can hold 1160 servers. About Google's data center, we only know so much.

Figure 1 Design of Google's own servers

Figure 2 server owned battery

Google has hundreds of thousands of servers are their own design, they think this is one of the company's core technology (51CTO recommended article: Google to the server? Intel to be careful .) Each server is equipped with a 12-volt battery, to ensure sustainable if the main power supply of power outages.

As to why for each server is equipped with batteries, Google's answer is cost. Generally more dependent on data center UPS (uninterruptible power supply system), which is basically regarded as a large batteries, power failure while in the main generator had a chance to start a temporary help supply. And Google that built into the server directly to the cheaper electricity, and costs can be directly in line with the number of servers, so they will not waste the extra capacity. Another reason is that the efficiency of large-scale UPS up to 92-95% efficiency, which means that many power is wasted, but Google's built-in battery practice efficiency by more than 99.9%.

Figure 3 Google's server is installed in the container, each container capacity of 1160 units

Figure 4, the work of Google employee

Google how to find and include your uploaded content?

Figure 5 before in user searches

Google uses its "crawler" tool to travel the Internet around the clock every corner of the world. 6 steps between the figure above depicts the sequence appeared on the Internet from the content to the content to be included in Google's database into the search process for users, of which there are many branches of the first step 2,3,5, all of which are intended to establish a message "sinks pool", this is the first stage, second stage is from this "pool" for users to filter the content they need. Next we look at Google is a step by step how to collect and integrate information.

1, users upload content, such as blog, microblogging, or other types of WEB content is updated on the web.

2, Google's "crawlers" that this update. In this step, Google has added a number of decision criteria include the following:

2.1, Google's "crawlers" link along the path (URL) travel around the Internet, but if the URL does not point to a site, then this site will not be indexed.

2.2, if you are not allowed in the robots.txt index set in (some or all), Google's "spiders" to crawl your site will not be the appropriate content.

2.3, if the connection point to your site has nofollow tags, Google's "crawlers" will be removed from the URL path to come to your site. Below:

Figure 6 Figure 7 pages nofollow tag in the source code

URL like the Google "spiders" signs when traveling around the Internet, Google certainly hope that your valuable web page, so a mechanism must be taken to identify which URL is spam, nofollow tag that Google one of the methods advocated. Legal update the site staff will not upload garbage almost URL, but they often appear in the comment thread and a large number of forums, like the example shown above, the URL for Google is meaningless in terms of, in order to prevent the "crawlers" reach a site through the URL, in the source code they will automatically be added nofollow tags.

2.4, Google also through the blog software or xml site map to find your site.

2.5, the higher authority of the website from your website URL to link to the more authoritative your site the higher, but the Google "spiders" will always ignore the nofollow tag is added URL.

Above these points is probably the information Google has collected on the content of "access" requirement, it appears that in some open areas (such as the Forum) released a large number of URL in order to let Google focus, this little trick is no effect . These are the information collected by Google about what happened before, once the Google collection of information is what will happen? Please read on:

Figure 8 information "material" and storage

3, the information collected by Google for processing after the course, we should mainly include two steps, first information "material" and storage, the second is included to optimize the information required, the figure depicts the "material" and storage mainly consists of two parts: the page title and link data is stored in an index, for the breadth-first search (see the article title is very important, so editing the title of the party must have control of consciousness); Web content stored in another an index to retrieve the frequency is not high for the long tail, personalized, depth-first search.

At this point you may already know, when you use Google search, you are not always updated in the search of the Internet, but Google's cache search, but Google updates very quickly, and as far as possible and let the cache on the Internet content synchronization.

Figure 9 Optimization of the information has been included

4, Google URL-based assessment of the overall authority of the domain name and web pages.

5, check the website in order to prevent cheating, including the following:

5.1, Google's search quality and anti-spam review.

5.2,1 remote testing more than the quality of the user evaluation of search results.

5.3, Google PageRank levy blackmail your users have to report suspected spam.

5.4, Google under the Digital Millennium Copyright Act (DMCA) to remove pirated content.

6, in an analysis of the pages, each page is added to aid users to search many pieces of data.

From the information appeared on the Internet to be Google included, then Google these data analysis and optimization, thus, a real-time updates of Internet information "sinks pool" to set up, can be said that Google stores the entire Internet snapshot. And that we are at the Google search button before it does something, then we look at how Google responds to user's search request, while Google's ads is to come before us, do not forget, Google, but rely on advertising to make a living of.

As long as people use Google's services, it can make money, afraid of like Andrew (Android) phone system, as some rogue manufacturers to Andrew packed in its own smart phone, but its on all of Google's various services wiping out, use their services, so quit Google, of course, so Andrew an update, these rogue mobile phone manufacturers to tension.

Google can help users search?

Figure 10 retrieved from the user to generate preliminary results began to

Retrieved from the user to generate preliminary results beginning (when the results are not directly presented to the user), has experienced four steps:

1, the user search request. PatrickRiley Google search quality engineer, said: In most searches, your search is in the process of multiple concurrent control or innovative Google Labs project team process, we can say that every query will be involved in some of Google's innovative experiments. We are the mice?

2, Google will provide some key words entered by the user suggestions.

3, Google will use the synonym matching your search terms with similar semantic query results.

4, generate the initial query results, while Google claims tens of thousands of relevant results can be found, but generally only shows less than 1000, while the query results will be the localization, the local site first appear in the query results.

