How to obtain the source and which search engine keywords

2011-05-13  来源:本站原创  分类:Internet  人气:147 

Achieve this function, the basic principle is to get to the source address, and then analyze the content of the search engine needed to remove the name and keywords.
Access to the source address is very simple, in the servlet, you can HttpServletRequest.getHeader ("Referer") method to get, jsp page can request.getHeader ("referer") to obtain. Made after the source address of the source can be obtained through the analysis we need the address of the content. Usually we have the following common search engines 14.

http://www.google.com;

http://www.google.cn;

http://www.sogou.com;

http://so.163.com;

http://www.iask.com;

http://www.yahoo.com;

http://www.baidu.com;

http://www.3721.com;

http://www.soso.com;

http://www.zhongsou.com;

http://www.alexa.com;

http://www.search.com;

http://www.lycos.com;

http://www.aol.com ;

To get the content we need, we must analyze the characteristics of each engine, each search engine because of its format, access to the source address must also inconsistent, let's analyze the various search engines address format.

Enter a keyword in the search engine, click on the search after the contents of the address bar that we passed HttpServletRequest.getHeader ("Referer") or request.getHeader ("referer") to obtain the source address.

google search engine:
http://www.google.com/search?hl = zh-N & newwindow = 1 &
q =% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80 &
btnG =% E6% 90% 9C% E7% B4% A2 & lr =

http://www.google.cn/search?hl = zh-N & newwindow = 1 &
q =% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF &
btnG =% E6% 90% 9C% E7% B4% A2 & meta =

From here we can get we need a search engine name and keywords. Among them, the search engine is obvious that google; the keywords it? After I carefully observed,
Tests found keyword is coded on the parameters q, the means
% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80 and
% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF
Is the input keywords.

(Some people may ask, that btnG What is the value of this parameter backing, he had compiled code ah? Is used to doing that? Oh, what backing do not have it, did nothing, extra! You try to enter keyword click the Search button to see the address bar, and then try to enter a keyword after the enter, then see the address bar, the two approaches in the address bar to see a little bit later you will understand the difference between your friends)

baidu search engine:
(1) http://www.baidu.com/s?ie=gb2312&bs =% CB% B3% B5% C2% BC% D2% BE% DF & sr = & z = & cl = 3 & f = 8 &
wd =% BD% F1% BF% C6% BF% C6% BC% BC & ct = 0
(2) http://www.baidu.com/baidu?tn=nanlingcb&word =% CB% B3% B5% C2% BC% D2% BE% DF

baidu search engine, where the need to explain, when we enter through the http://www.baidu.com search keywords, get the source address (1) string; as by other means, such as in some browsers Enter the keyword plug-in access to the source address (2) string. Through access to this source, I can easily know the current search engine is baidu; the keywords it? See (1), there are two encoded strings, which is the keyword in the end it? wd value is key! Believe me! What is the value that it bs? More search keywords you enter a few times and see what you find? Found it, bs is the last time you search for keywords! That we matter, it is not what we wanted. Analysis showed that, in baidu search engine there are two places to put keywords, a place that is encoded on the parameters of wd, the other place is encoded on the word parameter. Understand? :)

sogou search engine

http://www.sogou.com/web?query =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC

The not so complicated, we can know through the string search engine sogou, keyword query by encoded on the parameters, the value here
% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC, sometimes with more parameters, but with these parameters is of no use to us.

163 Search Engine

http://cha.so.163.com/so.php?in=seek&c=26&key=032152284&q =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC

& X = 61 & y = 19
This is not complicated, analysis showed that the name of the search engine 163, the keyword in the parameter q, where the value of% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC

yahoo search engine

http://search.cn.yahoo.com/search?p =% D3% C0% B0% B2% C2% B7% B5% C6 &

source = toolbar_yassist_button & pid = 58061_1006 & ei = gbk

http://search.cn.yahoo.com/search?lp =% E4% B8% AD% E5% B1% B1% E5% 8F% A4% E9% 95% 87% E7% 81% AF% E9% A5 % B0 &

p =% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80 & pid = & ei = UTF-8
Very easy to get, the search engine name yahoo, what does that keyword? Keyword is on the parameter p, the value of the parameter lp baidu with similar, but also a search on the keywords.

lycos search engine

http://search.lycos.com/?query=website

That we use less than the same string that we obtained through this search engine lycos, keyword query in place.

3721 Search Engine

http://seek.3721.com/index.htm?name =% D6% E9% BA% A3% CF% E3% D6% DE% C0% CD% CE% F1% CA% D0% B3% A1

Easy to get, the search engine name is 3721, on the name in the keyword

search search engine

http://www.search.com/search?lq=d% 25E4% 25B8% 25AD% 25E5% 259B% 25BDd &

q =% E4% B8% AD% E5% 8D% 8E% E4% BA% BA% E6% B0% 91% E5% 85% B1% E5% 9B% BD% E5% 92% 8C
We use very little of this, but also easy to get the search engine name search, keyword on the p in, and lp put what is it? Not yet clear,
Anyway, what we want and nothing to do.

soso search engine

http://www.soso.com/q?w =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC12 & sc = web &

bs =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC1 & ch = w.soso.pr & uin = & lr = chs & web_options = on
Can see that the search engine name is soso, keyword on the parameter w, the required parameters with baidu similar to the value of bs is a search on the keywords

zhongsou search engine

http://p.zhongsou.com/p?w =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC & l = & jk = & k = & r = & aid = & pt = 1 & dt = 2

Can see that the search engine name zhongsou, the parameter w in the keyword.

alexa search engine

http://www.alexa.com/search?q =% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80

Come to the search engine name zhongsou, keyword on the parameter q in.

Url of various search engine analysis has been completed, all of these popular search engines understand the url of the format, let's see how we get from these strings to get the information we want, but also is how to extract from these strings we need the name of the search engine and search keywords. This of course use the powerful regular expressions. OK, now we have one after analysis of what each search engine, we extract regular expressions need.
Analysis of the first or the first google search engine:
As mentioned above, we have made the google search engine's address is this:
http://www.google.com/search?hl = zh-N & newwindow = 1
& Q =% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80 &
btnG =% E6% 90% 9C% E7% B4% A2 & lr =

http://www.google.cn/search?hl = zh-N & newwindow = 1
& Q =% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF &
btnG =% E6% 90% 9C% E7% B4% A2 & meta =

In fact, it also has a form like this:
(3) http://www.google.com/custom?hl=zh-CN&inlang=zh-CN&ie=GB2312&oe=GB2312&newwindow=1 &
client = pub3261928361684765 & cof = FORID% 3A1% 3BGL% 3A1% 3BBGC% 3AFFFFFF% 3BT% 3A% 23000000% 3BLC% 3A
% 230000ff% 3BVLC% 3A% 23663399% 3BALC% 3A% 230000ff% 3BGALT% 3A% 23008000% 3BGFNT% 3A
% 230000ff% 3BGIMP% 3A% 230000ff% 3BDIV% 3A% 23336699% 3BLBGC% 3A336699% 3BAH% 3Acenter% 3B
& Q =% C5% B7% C2% FC% D5% D5% C3% F7 & lr =
OH, my god, is not seen dizzy? The first not to halo, you will not look down ... you feel dizzy.

We carefully look at these three formats have one thing in common, we have not found it? Is his format is like this:

http://www.google .[...]/[...]& q = [keyword ][...]
[...] That have more than one character.

As (2) We put some on the inside [] to be seen more clearly:

http://www.google. [cn] / [search? hl = zh-CN & newwindow = 1]

& Q = [% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF] [& btnG =% E6% 90% 9C% E7% B4% A2 & meta =]
See what I mean? See to understand we are next out. So we can draw the google search engine regular expressions:

http: \ \ / \ \ / www \ \. google \ \. [a-zA-Z] + \ \ /. + [\ \ & \ \?] q = [^ \ \ &] *.

Now explain the meaning of this expression.
http: \ \ / \ \ / www \ \. This section is matched http://www., why there are so many more \? Because the character '/' and characters '.' in the regular expression has a special meaning , with '\' escape these two characters, '/' through '\ /' escape, similar. but also through '\.' escape, and the character '\' in java there is a special character, itself needs to escape, so '\ /' as '\ \ /', similar to the '\.' as '\ \.';

The next google \ \. [A-zA-Z] + \ \ /. + Match google.com / search? Hl = zh-CN & newwindow = 1, here to explain [a-zA-Z] +, which means at least one (including one) or more letters, [a-zA-Z] that from a to z, from A to Z characters, + indicates that at least more than one, [\ \ & \ \?] q = [^ \ \ & ] * match the & q =% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF, [\ \ & \ \?], said the characters & or characters? As & and ? are special characters, so should use the escape character escape, q = [^ \ \ &] * q = followed by that of zero (including zero) or more of the non-& character, [^ \ \ &] said & character is not, why not for the &, because the & character is also followed by the no longer belong to the value of parameter q, we have to take the q = after & before the character string. The explanation of regular expressions to this was. Now the regular expression has come from many sources to obtain the address of which is distinguished by google search engine, but there is a problem, if not the case after the google search engine, and replaced http://search.google.com/ search? hl = zh-CN & newwindow = 1 & q =% E6% B0% B8% E5% AE% 89% E8% B7% AF% E7% 81% AF
& BtnG =% E6% 90% 9C% E7% B4% A2 & meta = it,

The regular expression that is inappropriate, how can changes in the future then we also write regular expression for it? Quite simply, we change it like this: \ \. Google \ \. [A-zA-Z] + \ \ /. + [\ \ & \ \?] Q = [^ \ \ &] *, meaning We do not have to match the string of string http://www. So, if you do a google search engines like http://search.google.com/ ... .. changes, we write the regular expressions are applied, and that if it has changed the domain name have said no,:) ; another case, in the address bar enter www.google.com:80/ normal access can google, that there is a situation that increases access to ports, but also take into account this situation, so before we regular expression should be changed to: \ \. google \ \. [a-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] q = [^ \ \ &] *, (: \ \ d {1,}) {0,1} What does that mean? He matched a similar ": 80" That is a colon (:) followed by one or more numeric characters, and the port is optional, and if only once, so use {0,1}. The regular expression use is used to obtain the keywords, so here I put some keywords into a group (which will be used below), the final regular expression is:

\ \. Google \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &] *)

On google search engine has made it very detailed, the next I will briefly talk about, the principles are similar, the.

baidu search engine:
Analysis showed that the search engine baidu regular expression is:
\ \. Baidu \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Wd = ([^ \ \ &] *) or
\ \. Baidu \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Word = ([^ \ \ &] *)

sogou search engine

http://www.sogou.com/web?query =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC

Regular expression:
\ \. Sogou \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Query = ([^ \ \ &] *)

yahoo search engine regular expressions:
\ \. Yahoo \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] P = ([^ \ \ &] *)

lycos search engine

http://search.lycos.com/?query=website

Regular expression:
\ \. Lycos \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ / .* [\ \ & \ \?] Query = ([^ \ \ &] *)

3721 Search Engine

http://seek.3721.com/index.htm?name =% D6% E9% BA% A3% CF% E3% D6% DE% C0% CD% CE% F1% CA% D0% B3% A1

http://seek.3721.com/index.htm?q =% D6% E9% BA% A3% CF% E3% D6% DE% C0% CD% CE% F1% CA% D0% B3% A1

Regular expression:
\ \ .3721 \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] P = ([^ \ \ &] *) or
\ \ .3721 \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Name = ([^ \ \ &] *)

regular expression search search engines:
\ \. Search \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &] *)

soso regular expression search engine:
\ \. Soso \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] W = ([^ \ \ &] *)

zhongsou search engine

http://p.zhongsou.com/p?w =% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC & l = & jk = & k = & r = & aid = & pt = 1 & dt = 2

Regular expression:
\ \. Zhongsou \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] W = ([^ \ \ &] *)

alexa search engine

http://www.alexa.com/search?q =% E4% BB% 8A% E7% A7% 91% E4% BF% A1% E6% 81% AF% E7% A7% 91% E6% 8A% 80

Regular expression:
\ \. Alexa \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &] *)

iask regular expression search engines:
\ \. Iask \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] K = ([^ \ \ &] *) or
\ \. Iask \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] _searchkey = ([^ \ \ &] *)

Well, the regular expression has been written out, it has been half done. Now we turn the conversation about, and so will we turn back, and now we look at how to obtain the search engine's name. Similarly, need to use regular expressions, regular expression is too strong:).
We can match the following regular expression to the google search engine:
http: \ \ / \ \ / .* \ \. google \ \. com (: \ \ d {1,}) {0,1} \ \ / or
http: \ \ / \ \ / .* \ \. google \ \. cn (: \ \ d {1,}) {0,1} \ \ /

Can match other similar search engines, I write them together:
http: \ \ / \ \ / .* \ \. (google \ \. com (: \ \ d {1,}) {0,1} \ \ / | google \ \. cn (: \ \ d {1 ,}) {0,1} \ \ / |
baidu \ \. com (: \ \ d {1,}) {0,1} \ \ / | yahoo \ \. com (: \ \ d {1,}) {0,1} \ \ / |
iask \ \. com (: \ \ d {1,}) {0,1} \ \ / | sogou \ \. com (: \ \ d {1,}) {0,1} \ \ / |
163 \ \. Com (: \ \ d {1,}) {0,1} \ \ / | lycos \ \. Com (: \ \ d {1,}) {0,1} \ \ / |
aol \ \. com (: \ \ d {1,}) {0,1} \ \ / | 3721 \ \. com (: \ \ d {1,}) {0,1} \ \ / |
search \ \. com (: \ \ d {1,}) {0,1} \ \ / | soso.com (: \ \ d {1,}) {0,1} \ \ / |
zhongsou \ \. com (: \ \ d {1,}) {0,1} \ \ / | alexa \ \. com (: \ \ d {1,}) {0,1} \ \ /)
The following procedures can get to the search engine's name:

import java.util.regex .*; public class GetEngine {public static void main (String [] arg) {GetEngine engine = new GetEngine (); String referer = "http://www.baidu.com/s?wd = java% D1% A7% CF% B0% CA% D2 "; String engineName = engine.getSearchEngine (referer); System.out.println (" Search engine name: "+ engineName);} public String getSearchEngine (String refUrl) { if (refUrl.length ()> 11) {/ / p is the match a variety of regular expression search engine Pattern p = Pattern.compile ("http: \ \ / \ \ / .* \ \. (google \ \. com (: \ \ d {1,}) {0,1} \ \ / | google \ \. cn (: \ \ d {1,}) {0,1} \ \ / | baidu \ \. com ( : \ \ d {1,}) {0,1} \ \ / | yahoo \ \. com (: \ \ d {1,}) {0,1} \ \ / | iask \ \. com (: \ \ d {1,}) {0,1} \ \ / | sogou \ \. com (: \ \ d {1,}) {0,1} \ \ / | 163 \ \. com (: \ \ d {1,}) {0,1} \ \ / | lycos \ \. com (: \ \ d {1,}) {0,1} \ \ / | aol \ \. com (: \ \ d {1 ,}) {0,1} \ \ / | 3721 \ \. com (: \ \ d {1,}) {0,1} \ \ / | search \ \. com (: \ \ d {1,} ) {0,1} \ \ / | soso.com (: \ \ d {1,}) {0,1} \ \ / | zhongsou \ \. com (: \ \ d {1,}) {0, 1} \ \ / | alexa \ \. com (: \ \ d {1,}) {0,1} \ \ /)"); Matcher m = p.matcher (refUrl); if (m.find () ) / / If the source address can match the above pattern {/ / because m.group (0) is a domain name, m.group (1) is our best we want return insteadCode (m.group (1), "( \ \. com (: \ \ d {1,}) {0,1} \ \ / | \ \. cn (: \ \ d {1,}) {0,1} \ \ / | \ \. org (: \ \ d {1,}) {0,1} \ \ /)","");// to. com,. cn,. org replaced ""}} return "not found in the search engine"; } public String insteadCode (String str, String regEx, String code) {Pattern p = Pattern.compile (regEx); Matcher m = p.matcher (str); String s = m.replaceAll (code); return s;}}

The above code can be derived by the search engine name, and it seems more than half of the task is completed. Only then do down than previously done little to trouble, trouble trouble in encoding.
Looking back now I like a lot of us write in a variety of regular expression search engine.
Due to a large number of string manipulation here, where the connection string using the StringBuffer to do.
StringBuffer sb = new StringBuffer ();
sb.append ("\ \. google \ \. [a-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] q = ([^ \ \ &]*)")
. Append ("| \ \. Iask \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] K = ([^ \ \ &]*)")
. Append ("| \ \. Iask \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] _searchkey = ([^ \ \ &]*)")
. Append ("| \ \. Sogou \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Query = ([^ \ \ &]*)")
. Append ("| \ \ .163 \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &]*)")
. Append ("| \ \. Yahoo \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] P = ([^ \ \ &]*)")
. Append ("| \ \. Baidu \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Wd = ([^ \ \ &]*)")
. Append ("| \ \. Baidu \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Word = ([^ \ \ &]*)")
. Append ("| \ \. Lycos \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ / .* [\ \ & \ \?] Query = ([^ \ \ &]*)")
. Append ("| \ \. Aol \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Encquery = ([^ \ \ &]*)")
. Append ("| \ \ .3721 \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] P = ([^ \ \ &]*)")
. Append ("| \ \ .3721 \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Name = ([^ \ \ &]*)")
. Append ("| \ \. Search \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &]*)")
. Append ("| \ \. Soso \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] W = ([^ \ \ &]*)")
. Append ("| \ \. Zhongsou \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] W = ([^ \ \ &]*)")
. Append ("| \ \. Alexa \ \. [A-zA-Z] + (: \ \ d {1,}) {0,1} \ \ /. + [\ \ & \ \?] Q = ([^ \ \ &]*)");

The regular expression is used all the search engines or "|" link, as long as a search engine which matches the regular expression can.
As already mentioned, the keyword is coded, and we removed keywords directly as% BD% F1% BF% C6% D0% C5% CF% A2% BF% C6% BC% BC12,
We can not understand this keyword, because some of these key anti-needed coding, which use to java.net.URLDecoder.decode (String s, String enc), this method has two parameters, one parameter is to be Anti-encoded string, and the other is to specify the character set. The first argument is very simple, as long as we get into this argument in the key, as the second parameter do? Here I only discuss the case of Chinese, these search engines, there are two character set encoding, one is UTF-8, the other is GBK.
GBK encoding is only a search engine:
3721, iask, sogou, 163, baidu, soso, zhongsou
UTF-8 encoding only way the search engines:
lycos, aol, search, alexa
There are two encoding modes:
google, yahoo

There is only one encoding problem easy to solve, there are two encoding do? Way more than the problem, in fact, work by which a coding side, it is a "hint", for google, in most cases it is the use of UTF-8 encoding, we are in the browser's address bar, enter www. google.com search are encoded in this way, but the kind of situation such as:

http://www.google.com/custom?hl=zh-CN&inlang=zh-CN&ie=GB2312&oe=GB2312&newwindow=1&client=pub-3261928361684765 &

cof = FORID% 3A1% 3BGL% 3A1% 3BBGC% 3AFFFFFF% 3BT% 3A% 23000000% 3BLC% 3A% 230000ff
% 3BVLC% 3A% 23663399% 3BALC% 3A% 230000ff% 3BGALT% 3A% 23008000% 3BGFNT% 3A% 230000ff% 3BGIMP% 3A
% 230000ff% 3BDIV% 3A% 23336699% 3BLBGC% 3A336699% 3BAH% 3Acenter% 3B & q =% C5% B7% C2% FC% D5% D5% C3% F7 & lr =

This case is not necessarily UTF-8 encoding, and this case in order to specify this parameter ie, where ie = gb2312, so encoding to gb2312, and is gb2312 gbk character set, so instead here we use gbk gb2312; for yahoo similar, but in most cases use yahoo GBK encoding, such as:

http://search.cn.yahoo.com/search?p =% C5% B7% C2% FC% BF% C6% BC% BC% CA% B5% D2% B5

& Source = toolbar_yassist_button & pid = 54554_1006 & f = A279_1
Is the GBK encoding, but this situation:

http://search.cn.yahoo.com/search?ei=gbk&fr=fp-tab-web-ycn&source=errhint_up_web

& P =% BD% F1% BF% C6 & meta = vl% 3Dlang_zh-CN% 26vl% 3Dlang_zh-TW & pid = ysearch
To use the ei parameters specified in the way of the spinning code, there may specify that the gbk, may also be specified in the UTF-8.
According to the above interpretation, then the following procedure to obtain a variety of search engine keywords:

import java.util.regex .*; import java.sql .*; import java.net.URLDecoder; import java.io. *; public class GetKeyword {public static void main (String [] arg) {String referer = "http : / / www.baidu.com/s?wd=java% D1% A7% CF% B0% CA% D2 "; if (arg.length! = 0) {referer = arg [0];} GetKeyword getKeyword = new GetKeyword (); String searchEngine = getKeyword.getSearchEngine (referer); System.out.println ("searchEngine:" + searchEngine); System.out.println ("keyword:" + getKeyword.getKeyword (referer));} public String getKeyword (String refererUrl) {StringBuffer sb = new StringBuffer (); if (refererUrl! = null) {sb.append ("(google \ \. [a-zA-Z ]+/.+[ \ \ & | \ \ ?] q = ([^ \ \ &]*)"). append ("| iask \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] k = ([^ \ \ &]*)"). append ("| iask \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] _searchkey = ([^ \ \ &]*)"). append ("| sogou \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] query = ([^ \ \ &]*)"). append (" | 163 \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] q = ([^ \ \ &]*)"). append ("| yahoo \ \. [a-zA-Z] + /. + [\ \ & | \ \?] p = ([^ \ \ &]*)"). append ("| baidu \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] wd = ([^ \ \ &]*)"). append ("| baidu \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] word = ([ ^ \ \ &]*)"). append ("| lycos \ \. [a-zA-Z ]+/.*[ \ \ & | \ \?] query = ([^ \ \ &]*)" ). append ("| aol \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] encquery = ([^ \ \ &]*)"). append (" | 3721 \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] p = ([^ \ \ &]*)"). append ("| 3721 \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] name = ([^ \ \ &]*)"). append ("| search \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] q = ([^ \ \ &]*)"). append ("| soso \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] w = ([^ \ \ &]*)"). append ("| zhongsou \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] w = ([^ \ \ &] * ) "). append (" | alexa \ \. [a-zA-Z ]+/.+[ \ \ & | \ \?] q = ([^ \ \ &]*)"). append (") "); Pattern p = Pattern.compile (sb.toString ()); Matcher m = p.matcher (refererUrl); return decoderKeyword (m, refererUrl);} return null;} public String decoderKeyword (Matcher m, String refererUrl) {String keyword = null; String encode = "UTF-8"; String searchEngine = getSearchEngine (refererUrl); if (searchEngine! = null) {if ((checkCode ("3721 | iask | sogou | 163 | baidu | soso | zhongsou ", searchEngine) | | (checkCode (" yahoo ", searchEngine) & &! checkCode (" ei = utf-8 ", refererUrl.toLowerCase ())))) {encode =" GBK ";} if (m.find ( )) {for (int i = 2; i <= m.groupCount (); i + +) {if (m.group (i)! = null) / / group in here to use the keyword {try {keyword = URLDecoder.decode (m.group (i), encode);} catch (UnsupportedEncodingException e) {System.out.println (e.getMessage ());} break;}}}} return keyword;} public String getSearchEngine (String refUrl) {if (refUrl.length ()> 11) {/ / p is the match a variety of regular expression search engine Pattern p = Pattern.compile ("http: \ \ / \ \ / .* \ \. (google \ \. com (: \ \ d {1,}) {0,1} \ \ / | google \ \. cn (: \ \ d {1,}) {0,1} \ \ / | baidu \ \ . com (: \ \ d {1,}) {0,1} \ \ / | yahoo \ \. com (: \ \ d {1,}) {0,1} \ \ / | iask \ \. com (: \ \ d {1,}) {0,1} \ \ / | sogou \ \. com (: \ \ d {1,}) {0,1} \ \ / | 163 \ \. com (: \ \ d {1,}) {0,1} \ \ / | lycos \ \. com (: \ \ d {1,}) {0,1} \ \ / | aol \ \. com (: \ \ d {1,}) {0,1} \ \ / | 3721 \ \. com (: \ \ d {1,}) {0,1} \ \ / | search \ \. com (: \ \ d { 1,}) {0,1} \ \ / | soso.com (: \ \ d {1,}) {0,1} \ \ / | zhongsou \ \. com (: \ \ d {1,}) {0,1} \ \ / | alexa \ \. com (: \ \ d {1,}) {0,1} \ \ /)"); Matcher m = p.matcher (refUrl); if (m. find ()) {return insteadCode (m.group (1), "(\ \. com (: \ \ d {1,}) {0,1} \ \ / | \ \. cn (: \ \ d { 1,}) {0,1} \ \ / | \ \. org (: \ \ d {1,}) {0,1} \ \ /)","");}} return "not found in search Engine ";} public String insteadCode (String str, String regEx, String code) {Pattern p = Pattern.compile (regEx); Matcher m = p.matcher (str); String s = m.replaceAll (code); return s ;} public boolean checkCode (String regEx, String str) {Pattern p = Pattern.compile (regEx); Matcher m = p.matcher (str); return m.find ();}}

相关文章
  • How to obtain the source and which search engine keywords 2011-05-13

    Achieve this function, the basic principle is to get to the source address, and then analyze the content of the search engine needed to remove the name and keywords. Access to the source address is very simple, in the servlet, you can HttpServletRequ

  • How to get the source and which search engine keywords 2011-05-13

    Function in this way, the basic principle is to get to the source address, and then analyze the content of the need to remove the search engine name and keywords. Access to the source address is very simple, in the servlet can HttpServletRequest.getH

  • C # get search engine keywords 2010-02-24

    using System; using System.Collections; using System.Configuration; using System.Data; using System.Linq; using System.Web; using System.Web.Security; using System.Web.UI; using System.Web.UI.HtmlControls; using System.Web.UI.WebControls; using Syste

  • Search Engine Optimization SEO 2011-10-01

    SEO (Search Engine Optimization) abbreviation search engine optimization, seo Is easily indexed by search engines through the use of reasonable means to make the site suitable for the basic elements of the principles of search engine retrieval and us

  • Basic working principle of search engine 2010-12-10

    According to the search engine work can be divided into three types, namely, full-text search engine (Full Text Search Engine), the directory type search engine indexing (Search Index / Directory) and meta-search engine (Meta Search Engine). ■ full-t

  • Search engine testing 2010-03-30

    Search engine, you must be very familiar, frequently encountered in working life, such as you would like to know how to get where you're going, then you have to search; you want to know what opera staged the weekend, you still have to search; You wan

  • jsp page and search engine cache settings 2010-05-28

    JSP page cache settings Keyword: page caching server method: Java code <% response.setHeader ("Pragma", "No-cache"); response.setHeader ("Cache-Control", "no-cache"); response.setDateHeader ("Expires",

  • 51CTO Download-Luncene2.0 + Heritrix developing its own search engine, extensive reading of a 2010-07-12

    Comin, drink a few bar, the update later, no BS. Extensive reading to see the information I used again after the intensive reading intensive reading, extensive reading and feel it is first noted. If passers-by, do not laugh at me! I am on the search

  • Luncene2.0 + Heritrix developing its own search engine, extensive reading of a 2010-07-12

    Comin, drink a few bar, the update later, no BS. Extensive reading to see the information I used again after the intensive reading intensive reading, extensive reading and feel it is first noted. If passers-by, do not laugh at me! I am on the search

  • 20 models open source search engine system 2010-04-16

    Some open source search engine system description, including open-source Web search engine and open source desktop search engine. Sphider Sphider is a lightweight, using PHP development of web spider and search engine, use the mysql to store data. Ca

  • Commonly used open source search engine tools - not easy to say love you 2010-05-09

    The following personal order, of course, not all are of the End. Some just interested in it. elasticsearch: cloud-based computing is a distributed search engine. Includes the following features: 1. Distributed, highly available search engines 2. To s

  • java open source search engine 2010-07-21

    Egothor Egothor is written with open source Java and efficient full-text search engine. With Java's cross-platform features, Egothor application can be applied to any environment, can configure a separate search engine, but also for your applications

  • Open source software built with vertical search engine 2010-10-15

    With Solr, Nutch and other open source software to build vertical search engine for electronic components involved in a lot of implementation details, this combination of practical application systems for data collection, the Chinese search results o

  • Open Source Search Engine 2010-03-29

    Open source search engine for people to learn, study and master the search technology provides an excellent way to and material to promote the popularization and development of search technology, so that more and more people began to understand and p

  • Using Cygwin on Windows, to obtain the source code for Android 2010-03-29

    Using Cygwin on Windows, to obtain the source code for Android 1 in preparation for Cygwin environment, which should curl, wget, python and other basic tools. 2, ready to store source code directory (eg: c: \ myeclair), into the Cygwin Shell environm

  • Experience the beginning of the open source search engine Nutch 2010-03-23

    Nutch source code, its authors and Lucene is a person, but in addition to using Lucene-based indexing and retrieval module, it also includes a crawler, crawl frontier, the reverse link to a database, Web search and other front-end components. It has

  • Records of learning open source search engine, Solr, Lucene process and experience 2010-03-09

    To learn any new things are always a perplexed to know from a process and then to suddenly see the light, (looks like and love like ha ha ha) Access to this record a little further open source search engine Lucene and learning step by step since solr

  • Open source search engine HubbleDotNet use graphical basis (turn) 2010-09-07

    Open source search engine based on the use of graphic HubbleDotNet 1, HubbleDotNet Introduction Second, HubbleDotNet the download and install and upgrade 3, HubbleDotNet use 4, HubbleDotNet common problems and solutions 1, HubbleDotNet Introduction H

  • Java source code search engine 2011-04-19

    Recommend a good java source code search engine, personal feeling good: http://qianxun.henii.com/

  • PHP version of the core search engine technology 2010-03-10

    Analysis of programming ideas We can do it: simulating a query to a search engine sites search order issued by the appropriate format, and then returns search results, and the results of HTML code analysis, stripping the extra characters and code, th