[Change] Java regular expression summary

2010-09-24  来源:本站原创  分类:Java  人气:375 

Study: Wind in the bamboo Time: 2010-09-21
Source: http://www.cnblogs.com/fzzl/archive/2010/09/21/1832794.html
Google search with "Java regular expression" is not very easy to find especially good on the java regular expression summary of the article, I read a few more, reproduced under the right conclusion:
Java regular expression explanation: http://www.blogjava.net/Werther/archive/2009/06/10/281198.html
Reveal the mystery of the regular expression: http://www.regexlab.com/zh/regref.htm
Java regular expression Xiangjie: http://edu.yesky.com/edupxpt/18/2143018.shtml

Java regular expression explanation expression meaning:
1. Character

x character x. For example, a character that a
\ \ Backslash character. In writing when the write to \ \ \ \. (Note: because java when the first resolution, to \ \ \ \ parsing into the regular expression \ \, again when the second resolution resolved to \, so all is not 1.1 listed to escape characters, including 1.1 \ \, but with a \ to have written twice)
\ 0n with octal value of 0 for character n (0 <= n <= 7)
\ 0nn with octal value of 0 characters nn (0 <= n <= 7)
\ 0mnn with octal value of 0 characters mnn (0 <= m <= 3,0 <= n <= 7)
\ Xhh with the characters 0x hexadecimal value hh
\ Uhhhh with the characters 0x hexadecimal value hhhh
\ T tab character ('\ u0009')
\ N new line (line feed) character ('\ u000A')
\ R carriage return character ('\ u000D')
\ F form feed character ('\ u000C')
\ A alarm (bell) character ('\ u0007')
\ E escape character ('\ u001B')
\ Cx The control character corresponding to x
2. Character class
[Abc] a, b or c (simple class). For example, [egd] that contains characters e, g, or d.
[^ Abc] any character, except a, b or c (negation). For example, [^ egd] that does not contain the characters e, g, or d.
[A-zA-Z] a to z or A to Z, two of the letters included (range)
[Ad [mp]] a to d, or m to p: [a-dm-p] (union)
[Az & & [def]] d, e or f (intersection)
[Az & & [^ bc]] a to z, in addition to b and c: [ad-z] (less)
[Az & & [^ mp]] a to z, instead of m to p: [a-lq-z] (less)
3. Predefined character classes (note the backslash to write two, such as \ d written as \ \ d) any characters (with the line ending may or may not match the match)
\ D number: [0-9]
\ D non-digit: [^ 0-9]
\ S whitespace character: [\ t \ n \ x0B \ f \ r]
\ S non-whitespace character: [^ \ s]
\ W word character: [a-zA-Z_0-9]
\ W non-word character: [^ \ w]
4.POSIX character class (only US-ASCII) (Note the backslash to write two, for example \ p (Lower) written as \ \ p (Lower))
\ P (Lower) lowercase character: [az].
\ P (Upper) uppercase character: [AZ]
\ P (ASCII) All ASCII: [\ x00-\ x7F]
\ P (Alpha) alphabetic characters: [\ p (Lower) \ p (Upper)]
\ P (Digit) decimal: [0-9]
\ P (Alnum) alphanumeric characters: [\ p (Alpha) \ p (Digit)]
\ P (Punct) punctuation :!"#$%&'()*+,-./:;<=>[email protected][ \ ]^_`{|}~
\ P (Graph) visible character: [\ p (Alnum) \ p (Punct)]
\ P (Print) printable character: [\ p (Graph) \ x20]
\ P (Blank) space or a tab: [\ t]
\ P (Cntrl) control character: [\ x00-\ x1F \ x7F]
\ P (XDigit) hexadecimal numbers: [0-9a-fA-F]
\ P (Space) whitespace characters: [\ t \ n \ x0B \ f \ r]
5.java.lang.Character classes (simple java character type)
\ P (javaLowerCase) is equivalent to java.lang.Character.isLowerCase ()
\ P (javaUpperCase) is equivalent to java.lang.Character.isUpperCase ()
\ P (javaWhitespace) is equivalent to java.lang.Character.isWhitespace ()
\ P (javaMirrored) is equivalent to java.lang.Character.isMirrored ()
6.Unicode block and the type of class
\ P (InGreek) Greek block (simple block) in the character
\ P (Lu) uppercase letter (simple category)
\ P (Sc) currency symbol
\ P (InGreek) all the characters, Greek, except block (negation)
[\ P (L }&&[^ \ p (Lu)]] for all letters, capital letters, except (minus)
7. Boundary matchers
^ Beginning of the line, in the beginning of the regular expression using ^. For example: ^ (abc) that the string begins with abc. Note that compile time to set parameters MULTILINE, such as the Pattern p = Pattern.compile (regex, Pattern.MULTILINE);
$ Line at the end, in the end of a regular expression to use. For example: (^ bca) .* (abc $), said to bca abc beginning to the end of the line.
\ B word boundary. For example, \ b (abc) that the start or end of a word contains abc, (abcjj, jjabc can match)
\ B non word boundary. For example, \ B (abc) that contains the middle of the word abc, (jjabcjj match and jjabc, abcjj do not match)
\ A input at the beginning
\ G on the end of a match (this parameter is of no use personal feeling). For example, \ \ Gdog that in the end to find a dog match, if not from the beginning of the search, note that if the beginning is not a dog can not match.
\ Z input end, only for the final ending (if any)
End of line is one or two characters of the series, marking the end of the line input character sequence.
The following code was identified as end of line:
- New line (line feed) character ('\ n'),
- Is followed by a carriage return new line character ("\ r \ n"),
- Single carriage return ('\ r'),
- Next line character ('\ u0085'),
- Line separator ('\ u2028') or - paragraph separator ('\ u2029).
\ Z end of the input mode when compiled, you can set one or more signs, such as
Pattern pattern = Pattern.compile (patternString, Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CASE);
The following six signs are supported:
-CASE_INSENSITIVE: match the character has nothing to do with the case, the sign only consider the US ASCII default character.
-UNICODE_CASE: When combined with CASE_INSENSITIVE, the use of Unicode letters match-MULTILINE: ^ and $ match the beginning and end of a line, not the entire input-UNIX_LINES: When multi-line mode matching ^ and $, the only '\ n 'as the row terminator-DOTALL: When using this flag, the. symbols match, including line terminators, including all of the characters-CANON_EQ: consider the standard equivalent Unicode characters
8.Greedy quantifiers
X? X, one or no
X * X, zero or more times
X + X, one or more times
X (n) X, exactly n times
X (n,) X, at least n times
X (n, m) X, at least n times, but not more than m times
9.Reluctant quantifiers
X?? X, one or no
X *? X, zero or more times
X +? X, one or more times
X (n)? X, exactly n times
X (n,)? X, at least n times
X (n, m)? X, at least n times, but not more than m times
10.Possessive quantifiers
X? + X, one or no
X * + X, zero or more times
X + + X, one or more times
X (n) + X, exactly n times
X (n,) + X, at least n times
X (n, m) + X, at least n times, but not more than m times
Greedy, Reluctant, Possessive difference is: (Note that only for. So fuzzy processing time)
greedy quantifier is seen as "greedy", since it was first enrolled into the fuzzy matching of strings. If the first match try (the input string) fails, matcher will match the string being the last one back a character and try again, repeat this process until a match is found or no more remaining characters can back up. According to the expression used in the classifier, its last attempt to match the content of 1 or 0 characters.
However, reluctant quantifiers to take the opposite approach: they start from the beginning of the string being matched, and then gradually to read a character search for a match. Their last attempt to match the content of the input string.
Finally, possessive quantifiers always read the entire input string, try one (and only one) match. And greedy quantifiers different, possessive never retreat.
11.Logical operator
XY X followed by Y
X | YX, or Y
(X) X, as the capture group. For example, (abc) said the acquisition as a whole abc
12.Back quote
\ N matches the nth capturing group of any group can capture their opening parentheses from left to right to calculate numbers. For example, in the expression ((A) (B (C))), there are four such groups:
1 ((A) (B (C)))
2 \ A
3 (B (C))
4 (C)
In expression can be \ n to the corresponding reference groups, such as (ab) 34 \ 1 on that ab34ab, (ab) 34 (cd) \ 1 \ 2 on that ab34cdabcd.
13. References
\ Nothing, but the reference to the following characters
\ Q Nothing, but the reference to all the characters, until \ E. QE string will be unchanged between the use of (1.1 escape characters excluded). For example, ab \ \ Q (|) \ \ \ \ E
Matches ab (|) \ \
\ E Nothing, but the end from the \ Q starting reference
14. Special structure (non-capture)
(?: X) X, as a non-capturing group
(? Idmsux-idmsux) Nothing, but will match the logo from the on to off. For example: the expression (? I) abc (?-I) def Then, (? I) open the case insensitive switch, abc match
idmsux as follows:
-I CASE_INSENSITIVE: US-ASCII character set are not case sensitive. (? I)
-D UNIX_LINES: Open UNIX newline-m MULTILINE: multi-line mode (? M)
Under the changed behavior of UNIX \ n
Under the changed behavior WINDOWS \ r \ n (? S)
-U UNICODE_CASE: Unicode case insensitive. (? U)
-X COMMENTS: can be used in the pattern inside the annotation, which ignore the pattern of whitespace, and "#" until the end (the back of the # comments). (? X) such as (? X) abc # asfsdadsa can match the string abc
(? Idmsux-idmsux: X) X, as with a given sign on - off of the non-capture group. Similar to above, the above expression can be rewritten as: (? I: abc) def, or (? I) abc (?-I: def)
(? = X) X, via zero-width positive lookahead. Is first zero-width assertion, only when the sub-expression X in this position when the right match to match. For example, \ w + (? = \ D) that the letter followed by numbers, but does not capture the number (not back)
(?! X) X, via zero-width negative lookahead. Zero-width negative first assertion. Sub-expression only if X is not the right match for this position only to match. For example, \ w + (?! \ D) that is not with the number of letters followed, and does not capture the number.
(? <= X) X, via zero-width positive lookbehind. Zero-width assertion is made after. X only if the subexpression matches at this position only to match the left. For example ,(?<= 19) 99 99 that is in front of the number 19, but does not capture the previous 19. (Not back)
(? (?> X) X, as an independent, non-capture group (not back)
(? = X) and (?> X) the difference is (?> X) is not backtracking. For example, the string being matched to abcm
When the expression is a (?: B | bc) m can be matched, and when the expression is a (?> B | bc) can not match the time, because if the latter match to b, as has been matched, to jump out of the non-capture group rather than another group of characters on the match. Can speed up.

Note: Some critics say that's a problem the last sentence - "There are problems! Abcm can also be a (?> B | bc) match!"

Introduction to regular expression (regular expression) is to use a "string" to describe a feature, and then to verify that the other "string" is consistent with this feature. For example the expression "ab +" feature described as "a 'a' and any number of 'b'", then the 'ab', 'abb', 'abbbbbbbbbb' are consistent with this feature.

Regular expressions can be used: (1) verify whether the string specified characteristics, such as verify that legitimate email address. (2) to find a string, from a long text string to find comply with the specified characteristics, than a fixed string find more flexible and convenient. (3) to replace, more powerful than regular replacement.

Regular expressions is very simple to learn together, and the few more abstract concepts easy to understand. The reason why many people feel more complex regular expressions, partly because most of the greatest depth and explain the document failed to do so, there is no concept of attention to the order, to create difficulties for the reader's understanding; on the other hand, various engine document comes are generally required to describe its unique features, however, is not this part of the unique features we must first understand.

For example, each article, can click into the test page test. The less gossip began.
________________________________________
1. Regular Expression Rules
1.1 The general character of letters, numbers, characters, underscores, and back with no special definition of punctuation, are "ordinary characters." Expression of the general character of a string in the match when the match the same character.

For example 1: the expression "c", in the match string "abcde", match result: success; matched to the content: "c"; matched to the position: starts at 2, ends at 3. (Note: subscript from 0 or 1, due to a difference in the current programming language may be different)

Example 2: the expression "bcd", in the match string "abcde", match result: success; matched to the content: "bcd"; matched to the position: starts at 1, ends at 4.
________________________________________
1.2 simple escape character of some inconvenience to write the characters, preceded by the "\" approach. In fact, we have these characters know it.
Expression can be matched
\ R, \ n behalf of carriage return and line feed
\ T tab character
\ \ Represents "\" itself, there are other sections in the back of a special use of punctuation, with the addition "\" after the symbol itself represents. For example: ^, $ has a special meaning, if it is to match the string "^" and "$" character, then the expression will need to write "\ ^" and "\ $."
Expression can be matched
\ ^ Match ^ symbol itself
\ $ Match the $ symbol itself
\. Match the decimal point (.) Escape character itself, the matching of these methods and "normal characters" are similar. Also match the same character.

For example 1: the expression "\ $ d", in the match string "abc $ de", the match result: success; matched to the content: "$ d"; matched to the position: starts at 3, end 5.
________________________________________
1.3 to work with 'multiple characters' expressions match some regular expression that can match 'a variety of character' in which any one character. For example, the expression "\ d" can match any number. Although one can match any character, but only one, not more. This is like playing cards when the size of King can replace any card, but only to replace a card.
Expression can be matched
\ D Any digit, 0 to 9 in any one
\ W any of the letters or numbers or an underscore, that is, A ~ Z, a ~ z, 0 ~ 9, _ in any one
\ S, including spaces, tabs, page breaks and other whitespace characters for which any of
. Decimal point matches in addition to newline (\ n) character other than an arbitrary example 1: The expression "\ d \ d", matches "abc123", match result: success; matched to the content: "12 "; match to the position: starts at 3, ends at 5.

Example 2: the expression "a. \ d", in the match "aaa100", the match result: success; matched to the content: "aa1"; matched to the position: starts at 1, ends at 4.
________________________________________
1.4 Custom match 'various characters' expressions use square brackets [] contains a series of characters that can match any one character. With [^] contains a series of characters, which can match any character other than a character. By the same token, though any one of them can match, but only one, not more.
Expression can be matched
[Ab5 @] matches "a" or "b" or "5" or "@"
[^ Abc] matches "a", "b", "c" than any one character
[Fk] match "f" ~ "k" to any one letter between
[^ A-F0-3] matches "A" ~ "F", "0" ~ "3" than any one character for example 1: The expression "[bcd] [bcd]" matches "abc123" when the match The result: success; matched to the content: "bc"; matched to the position: starts at 1, ends at 3.

Example 2: the expression "[^ abc]" matches "abc123", match result: success; matched to the content: "1"; matched to the position: starts at 3, ends at 4.
________________________________________
1.5 Modification of matching the number of special symbols mentioned in the previous section the expression, whether a character can only match the expression, or a variety of characters which can match any of the expressions are matched only once. If you use an expression to quantify matching plus special symbols, then do not repeat the written expression can repeat the match.

Use is: "the number of modified" on "is modified by the expression" behind. For example: "[bcd] [bcd]" can be written as "[bcd] (2)".
Expression function
(N) Match exactly n times, such as: "\ w (2)" is equivalent to "\ w \ w"; "a (5)" is equivalent to "aaaaa"

(M, n) Match at least m times, the most repeated n times, such as: "ba (1,3)" matches "ba" or "baa" or "baaa"

(M,) Match at least m times, such as: "\ w \ d (2,)" matches "a12", "_456", "M12344" ...

? Match the expression 0 or 1, is equivalent to (0,1), such as: "a [cd]?" Matches "a", "ac", "ad"

+ Match 1 or more times, equivalent to (1,), such as: "a + b" matches "ab", "aab", "aaab" ...

* Expression does not appear or occur any time, equivalent to (0), such as: "\ ^ * b" matches "b ","^^^ b" ...

For example 1: the expression "\ d + \.? \ D *" matches "It costs $ 12.5", the match result: success; matched to the contents: "12.5"; matched to the position is: started at 10 , ends at 14.

Example 2: the expression "go (2,8) gle" matches "Ads by goooooogle", the match result: success; matched to the content: "goooooogle"; matched to the position is: started at 7, ends at 17.
________________________________________
1.6 Other special symbols with abstract symbols in the pattern have special meaning:
Expression function
^ Match the beginning of the string does not match any character
$ And the string where the end of matches does not match any character
\ B Match a word boundary, that is, spaces between words and the location does not match the description of any character further still quite abstract, so, for example to help you understand.

For example 1: the expression "^ aaa" matches "xxx aaa xxx", match result: failure. Because "^" & the place to start the string match, therefore, only when "aaa" string beginning at the time, "^ aaa" to match, for example: "aaa xxx xxx".

Example 2: the expression "aaa $" matches "xxx aaa xxx", the match result: failure. Because the "$" where the end of the string match, therefore, only when "aaa" at the end of the string, when, "aaa $" to match, such as: "xxx xxx aaa".

Example 3: expression ". \ B." matches "@ @ @ abc", match result: success; matched to the content: "@ a"; matched to the position: starts at 2, ends at 4.
Further explanation: "\ b" and "^" and "$" is similar to itself does not match any character, but it requires that it matches the results in the location of left and right sides, one side is the "\ w" range, the other side of right and wrong "\ w" range.

For example 4: the expression "\ bend \ b" matches "weekend, endfor, end", match result: success; matched to the content: "end"; matched to the position is: started at 15, ends at 18.
Symbols can affect the expression of some sub-expression within the relationship between:
Expression function
| Between the left and right expression "or" relationships, matching the left or right
() (1). In the number of matches to be modified when the expression in brackets can be modified as a whole
(2). Get matching results when the expression in parentheses can be matched to the contents of the individual be an example 5: The expression "Tom | Jack" matches string "I'm Tom, he is Jack", the match The result: success; matched to the content: "Tom"; matched to the position is: started at 4, ends at 7. The next match, the match result: success; matched to the content: "Jack"; matched to the position: starts at 15, ends at 19.

Example 6: Expression "(go \ s *) +" matches "Let's go go go!", Match result: success; matched to the content: "go go go"; matched to the position is: started 6, ends at 14.

For example 7: expression "¥ (\ d + \.? \ D *)" in the match "$ 10.9, ¥ 20.5", the match result: success; matched to the contents: "¥ 20.5"; matched to the Location: starts at 6 ends at 10. The match of individual access to the contents of the brackets is: "20.5."
________________________________________
2. Regular expressions in some of the advanced rules
2.1 times in the greedy matching and non-greedy matching using modified when the number of special symbols, there are several ways that an expression can be matched with a different number, such as: "(m, n)", "(m, ) ""? "," * "," + ", specifically the number of matches to be matched with the string may be. This repeat match an indeterminate number of expressions in the matching process, always match as much as possible. For example, for text "dxxxdxxxd", example:
Expression matching results
(D) (\ w +)
"\ W +" will match the first "d" all the characters after the "xxxdxxxd"
(D) (\ w +) (d)
"\ W +" will match the first "d" and the last "d" between all the characters "xxxdxxx". Although the "\ w +" can match the last one "d", but in order to match the success of the expression, "\ w +" can "make out" it can match the last "d"
Thus, "\ w +" in the match when the match is always as much a character in keeping with its rules. In the second example, it does not match the last "d", but also to the whole pattern match successfully. Similarly, with "*" and "(m, n)" as much as the expression is matched with "?" The expression can be matched in the match from time to time, but also as far as possible "to match." This match is called "greedy" mode.
Non-greedy mode:

Match the number of times in the modified special symbols with a "?" Number, you can make an indefinite number of expressions match the match as little as possible, so that can match the expression from time to match, as far as possible "does not match." This type of matching is called "non-greedy" mode, also called "forced" mode. If fewer matches cause the entire expression will match fails, and the greedy mode is similar to non-greedy pattern matches the number of minimal re-order to match the success of the entire expression. The following are examples for the text "dxxxdxxxd" For example:
Expression matching results
(D) (\ w +?)
"\ W +?" Will match as little as possible the first "d" after the character, the result is: "\ w +?" Matches only one "x"
(D) (\ w +?) (D)
In order to match the success of the whole expression, "\ w +?" Must match "xxx" can only behind the "d" matches, so the entire expression matched successfully. Therefore, the result is: "\ w +?" Matches "xxx"
More often, for example as follows:

For example 1: the expression "<td> (.*)</ td>" with the string "<td> <p> aa </ p> </ td> <td> <p> bb </ p> </ td> "match, the match result: success; matched to the content of" <td> <p> aa </ p> </ td> <td> <p> bb </ p> </ td> " the entire string expression in "</ td>" string will be the last "</ td>" match.

Example 2: In contrast, the expression "<td> (.*?)</ td>" matches the same string for example 1, it will only get "<td> <p> aa </ p> < / td> ", again the next match, you can get a second" <td> <p> bb </ p> </ td> ".
________________________________________
2.2 reverse reference \ 1, \ 2 ...
Expression in the match, the expression engine will parentheses "()" contains the expression to the string matched record. Match results in the acquisition, when an expression contained in parentheses are the string can be individually matched to obtain. This is, in the previous examples, has been demonstrated many times. In practical applications, when using a border to find, but are not included to obtain the contents of the border, you must use parentheses to specify the desired range. For example the previous "<td> (.*?)</ td>".

In fact, "an expression that contains parentheses matched to the string" is not only the end of the match before they can use, in the matching process can be used. The back part of the expression, to quote the previous "sub-brackets to match the already matched string." Reference is "\" plus a number. "\ 1" refers to the first one pair of brackets to match the string, "\ 2" references the first two pairs of brackets to match the string ... ... and so, if a pair of brackets includes another pair of parentheses, then the outer layer of the first order number in parentheses. In other words, which on the left bracket "(" first, then the number on the first sort.
For example:

For example 1: expression "('|")(.*?)( \ 1) "matches" 'Hello', "World" ", match result: success; matched to the content:" 'Hello' . "The next match again, you can match to the" "World" ".

Example 2: the expression "(\ w) \ 1 (4,)" matches "aa bbbb abcdefg ccccc 111121111 999999999", match result: success; matched to the content of "ccccc". Again when the next match will be 999999999. Requirements of this expression "\ w" to repeat the scope of at least 5 characters, attention and "\ w (5,)" the difference between.

Example 3: Expression "<(\ w +) \ s * (\ w +(=('|").*? \ 4)? \ S *)*>.*?</ \ 1>" matches "< td> </ td> ", match result is successful. If" <td> "and" </ td> "do not match, it will match failure; if into the other pair, you can match the success.
________________________________________
2.3 Pre-search, do not match; reverse pre-search, does not match the previous chapter, I mentioned a couple of the special representative of abstract symbols :"^","$"," \ b ". They all have one thing in common, that is: they do not match any character in itself, but the "string 2" or "the gap between the characters," a condition attached. Understand the concept of the future, this section will continue to introduce another kind of "two" or "gap" additional conditions, more flexible representation.
Positive pre-search :"(?= xxxxx )","(?! xxxxx) "

Format :"(?= xxxxx) ", was matched in the string, it finds itself" gap "or" 2 "additional conditions are: the right side where the gap must be able to match this part of the expression on xxxxx type. Because it is only in this, as the gap on the conditions attached, so it does not affect the expression back to really match the characters after the gap. This is similar to the "\ b", in itself does not match any character. "\ B" only where the gap will be before, after the character was about to take to determine, does not affect the expression back to the real match.

For example 1: the expression "Windows (? = NT | XP)" matches "Windows 98, Windows NT, Windows 2000", it will only match the "Windows NT" in the "Windows", the other "Windows" is not the word be matched.

Example 2: the expression "(\ w )((?= \ 1 \ 1 \ 1) (\ 1)) +" matches the string "aaa ffffff 999999999", the will be able to match the 6 "f" in the first 4 one that can match 9 "9" before the 7. Solution of this expression can be read as: repeat 4 times more than the number of letters, then match the remaining part of the last two before. Of course, this expression can not say so here are intended as demonstration purposes.
Format :"(?! xxxxx) ", where the gap to the right, must not match xxxxx this part of the expression.

Example 3: expression "((?! \ bstop \ b ).)+" matches "fdjka ljfdl stop fjdsla fdj", it will scratch has been matched to the "stop" position before, if the string does not "stop" , then the match the whole string.

For example 4: the expression "do (?! \ W)" matches the string "done, do, dog", the only match "do". In this section an example of, "do" behind the use of "(?! \ W)" and use "\ b" The effect is the same.
Reverse Pre-search :"(?<= xxxxx )","(?<! xxxxx) "

The concept of the two formats and positive pre-search is similar to reverse the conditions of pre-search requirements are: where the gaps "left", both formats are required to be able to match and must not be able to match the specified expression, rather than to judge the right. And "positive pre-search" is the same: they are on an additional condition where the gap itself did not match any character.

For example 5: expression "(?<= \ d (4)) \ d + (? = \ D (4)) "matches" 1234567890123456 ", it will match in addition to the first 4 numbers and 4 digits after the outside the middle 8 digits. As JScript.RegExp does not support reverse pre-search, therefore, can not demonstrate this example. Many other search engines can support pre-reverse, such as: Java 1.4 and above java.util.regex package,. NET in the System.Text.RegularExpressions namespace, and the site recommended the most easy to use DEELX regular engine.
________________________________________
3. And some other general rules of the regular expression engine in all comparisons between general rules explained in the previous process was not mentioned.
3.1 expression, you can use "\ xXX" and "\ uXXXX" says a character ("X" indicates a hexadecimal number)
Form of the character range
\ XXX numbers range from 0 to 255 characters like: spaces can use the "\ x20" said

\ UXXXX Any character can use the "\ u" together with its code of four hexadecimal digits that, for example: "\ u4E2D"

3.2 The expression "\ s", "\ d", "\ w", "\ b" of special significance that, while the corresponding uppercase letters the meaning of the expression can match the opposite
\ S matches any non-blank character ("\ s" to match all blank characters)

\ D matches any non-numeric characters

\ W matches all the letters, numbers, underscore characters other than

\ B matches non-word boundary, that is the right and left are the "\ w" or the scope of the right and left are not "\ w" characters in the range of gap

3.3 in the expression of special significance, need to add "\" to match the character itself, character description character summary
^ Matches the beginning of the string. To match "^" character itself, use "\ ^"
$ Matches the end of the string. To match the "$" character itself, use "\ $"
() Grouping of the start and end position. To match parentheses, use "\ (" and "\)"
[] Use to customize to match 'a variety of characters' expressions. To match parentheses, use "\ [" and "\]"
() To quantify matching symbols. To match the braces, use "\ (" and "\)"
. Matching in addition to newline (\ n) other than any one character. Decimal point to match itself, please use the "\."
? Modified match 0 times or 1 time. To match "?" Itself, please use the "\?"
+ Modified match at least 1. To match the "+" character itself, use "\ +"
* Subpattern match 0 or any number of times. To match "*" character itself, use "\ *"
| Between the left and right expression "or" relationship. Match "|" itself, please use the "\ |"
3.4 parentheses "()" within the sub-expression, if you want to match the result is not recorded for later use, you can use "(?: Xxxxx)" format, for example 1: expression "(?:( \ w) \ 1) + "matches" a bbccdd efg ", the result is" bbccdd ". Range of matching brackets "(?:)" result is not recorded, so "(\ w)" use "\ 1" to refer to.
3.5 Pattern property profile: Ignorecase, Singleline, Multiline, Global
Expression attribute description
Ignorecase default, the expression of the letter is to case-sensitive. Configured to Ignorecase can match case insensitive. Some expression engine, the "case" concept extended to the case of UNICODE range.
Singleline default, the decimal point "." Matching in addition to newline (\ n) other characters. Singleline can be configured to match the decimal point, including all the characters, including newline.
Multiline default, the expression "^" and "$" matches only the beginning of the string and end ④ ①. Such as:

① xxxxxxxxx ② \ n
③ xxxxxxxxx ④

Multiline can configure the "^" matches ①, it can match the newline character, the next line position before the start of ③ to "$" match ④, it can match the newline before the end of a line ② position.
Global, mainly in the expression when used to replace the role, configured to replace all the matching Global said.
________________________________________
4. Other tips
4.1 If you want to understand the advanced engine also support those who are the complex is the grammar, see site DEELX are the engine documentation.
4.2 If required the content of expressions matched the entire string, rather than part of the string to find, you can begin and end the use of the expression "^" and "$", such as: "^ \ d + $" asked the whole string is only numbers.
4.3 if the request matches the content is a complete word, not part of a word, then end to end use of the expression "\ b", for example: use "\ b (if | while | else | void | int ... ...) \ b "to match the program keywords.
4.4 expressions do not match the empty string. Otherwise the match will always be successful, but the results have nothing to match. For example: preparing to write a match "123", "123.", "123.5", ".5" these forms of expression, the integer, decimal, decimal number can be omitted, but do not write the expression: "\ d * \.? \ d *", because if nothing else, this expression can be matched successfully. Better wording is: "\ d + \.? \ D * | \. \ D +".
4.5 to match the empty string sub-matches do not cycle indefinitely. If the brackets in each part of the sub-expressions can match 0 times, and this bracket as a whole and can match the unlimited, then the previous one may say even more serious, the matching process may be endless loop. Although some of the regular expression engine has been through way to avoid the endless loop of this situation, for example. NET regular expressions, but we still should try to avoid this situation. If we write the expression encountered in the death cycle, you can start from that point, look to see if it is what causes this.
4.6 rational choice model of greed mode and non-greedy, see the topic discussed.
4.7 or "|" left and right sides of a character that only one side can match the best, so that will not "|" both sides of the expression for the exchange of different location.
________________________________________
5. Advanced and actual combat with the foundation from the master in this article, we can further strengthen our practice to use a regular expression skills.
5.1 download chm version of the regular expression documentation
[Click to download the chm version] - DEELX regular grammar, syntax include other senior chm version.

5.2 to download are the tools Regex Match Tracer 2.0 Beta (Genuine is worth buying)
[Download Match Tracer] - 471kb

5.3 free Regex Match Tracer Web Edition
[Use Match Tracer Web version]
The Web version of tools for free use of main program from Regex Match Tracer limited trial period.

5.4 more in-depth topics and use cases
[Discussion on recursive matching] - to discuss how to use does not support recursion is the engine match nested structures
[Communication with the station in question] - the exchange and discussion with owners
[Page Script] - this page "close Highlight" function, using javascript's regular expression implementation.

For example the expression: (a + b | [cd]) $

If you've ever used Perl, or any other built-in support for regular expression language, you must know the address to use regular expressions match the pattern text and how easy it is. If you are not familiar with this term, then the "regular expression" (Regular Expression) is a character string composed of, it defines a string used to search for matching patterns.
Many languages, including Perl, PHP, Python, JavaScript and JScript, both support the use of regular expressions with text, some text editors Used regular expressions for advanced "search - replace" feature. Then Java so what? Of this writing, one contains text using regular expressions to deal with the Java Specification Requirements (Specification Request) has been approved, you can expect the next version of JDK to see it.
However, if we need to use regular expressions, how should we do? You can download the source code from the Apache.org open Jakarta-ORO library. Next, the contents of this article first briefly introduces the regular expression entry knowledge, and then to Jakarta-ORO API as an example how to use regular expressions.
First, the basics of regular expressions, we start with a simple start. Suppose you want to search for a containing characters "cat" string search with regular expression is "cat". If the search is not case sensitive, the word "catalog", "Catherine", "sophisticated" can match. That is:

1.1 point symbols in English spelling Suppose you are playing games, you want to find the three-letter words, and these words must be "t" letters at the beginning, to "n" the letter concluded. In addition, assume that an English dictionary, you can use regular expression search its entirety. To construct the regular expression, you can use a wildcard - dot symbol ".". Thus, the complete expression is "tn", it matches "tan", "ten", "tin" and "ton", also matches the "t # n", "tpn" or "tn", there are many other free significance of the combination. This is because the dot symbol matches all characters, including spaces, Tab characters or line breaks:

1.2 square brackets notation symbols to solve the matching period is so wide that the issue, you can specify in square brackets ("[]") which seems interesting characters. At this point, only the specified character inside the brackets to be involved in matches. That is, the regular expression "t [aeio] n" matches only "tan", "Ten", "tin" and "ton". But the "Toon" does not match, because in the square brackets you can only match a single character:

1.3 "or" symbols match if all the words in addition to the above addition, you also want to match "toon", then you can use "|" operator. "|" Operator is the basic meaning of "or" computing. To match the "toon", use "t (a | e | i | o | oo) n" regular expressions. Here the user can not expand the number, because the brackets allow only matching a single character; here must use parentheses "()"。 Parentheses can also be used to block specific description, see later.

1.4 times the symbol table that matches a match shows that the number of symbols, the symbols used to determine the sign next to the symbol on the left of the number of occurrences:

Suppose we want to search in a text file the U.S. social security number. The format of this number is 999-99-9999. It is used to match the regular expression as shown in Figure 1. In the regular expression, the hyphen ("-") is of special significance, it represents a range, such as from 0 to 9. Therefore, the matching Social Security number in the hyphen, it should be added in front of an escape character "\."

Figure 1: Match all forms of social security number 123-12-1234 hypothetical search, you want hyphen may appear, it may not occur - that is, ,999-99-9999 and 999 999 999 belong to the correct format. At this point, you can hyphen after "?" Limited number of symbols, as shown in Figure 2:

Figure 2: Match all 123-12-1234 and forms of social security number 123 121 234 Next we look at another example. American car license format is four numbers plus two letters. It is in front of the digital part of the expression "[0-9] (4)", together with the letter part of the "[AZ] (2)". Figure 3 shows a complete regular expressions.

Figure 3: Matching the typical American car license plate number, if 8836KV
1.5 "No" symbol "^" symbol known as the "no" symbol. If used in the square brackets, "^" indicates the character you want to match. For example, Figure 4 of the regular expression matching all words, but "X" letter at the beginning of the word except.

Figure 4: Match all words, but "X" at the beginning, except
1.6 parentheses and empty symbols
Suppose from format "June 26, 1951" to extract the birthday month of the date part of date to match the regular expression can be shown in Figure 5:

Figure 5: Match all Moth DD, YYYY date format emerging "\ s" symbol is a blank symbol, matching all white space characters, including the Tab character. If the string matches correctly, then how to extract the month part? Just add a parenthesis around the month to create a group, then ORO API (discussed in detail later in this article) to extract its value. The modified regular expression shown in Figure 6:

Figure 6: Match All Month DD, YYYY format the date, the definition of value of the first group of the month
1.7 Other symbols for simplicity, you can use some regular expression for the common shortcut created symbols. Table 2 below:
Table 2: Common symbols

For example, social security number in the previous example, all occurrences of "[0-9]" the place where we can use the "\ d". The modified regular expression shown in Figure 7:

Figure 7: Matching all the social security number 123-12-1234 format 2, Jakarta-ORO library there are many open source regular expression library for Java programmers to use, and many of them support Perl 5 compatible regular expressions Grammar. Here I use the Jakarta-ORO regular expression library, which is the most comprehensive one of the regular expression API, and it is with Perl 5 compatible regular expressions. In addition, it is optimization of one of the best of the API.
Jakarta-ORO library was called the OROMatcher, Daniel Savarese generously donated it to the Jakarta Project. Finally, you can follow the instructions to download it for reference resources.
First, I will briefly introduce the use Jakarta-ORO library, you must create and access objects, and then describes how to use the Jakarta-ORO API.
▲ PatternCompiler object First, create a Perl5Compiler instance of the class and assign it to PatternCompiler interface object. Perl5Compiler PatternCompiler interface is an implementation that allows you to compile a regular expression to match the Pattern object.

▲ Pattern objects should be compiled into a regular expression Pattern object, call the compiler object compile () method and call the regular expression specified in the argument. For example, you can compile in this way according to the following regular expression "t [aeio] n":

By default, the compiler creates a case sensitive mode (pattern). Therefore, the above code compiled by the pattern matches only "tin", "tan", "ten" and "ton", but does not match the "Tin" and "taN". To create a case-insensitive mode, you should call the compiler at the time designated an additional parameter:

Pattern object is created, you can use the Pattern class by PatternMatcher object pattern matching.
▲ PatternMatcher object
PatternMatcher Pattern object and the object under inspection to match the string. You have to instantiate a Perl5Matcher class and the results assigned to PatternMatcher interfaces. Perl5Matcher PatternMatcher interface class is an implementation of it under the Perl 5 regular expression syntax for pattern matching:

Use PatternMatcher object, you can use several methods of matching operation, the first parameter of these methods are the need to match the regular expression string:
• boolean matches (String input, Pattern pattern): When the input string and regular expression to use when an exact match. In other words, regular expressions must be a complete description of input string.
• boolean matchesPrefix (String input, Pattern pattern): When the regular expression match the input string is used when starting part.
• boolean contains (String input, Pattern pattern): When the regular expression to match the input string is used as part of (that is, it must be a substring).
In addition, the above three method calls, you can also use the object as a parameter substitution PatternMatcherInput String object; this time, you can string the last match to match the location of the beginning. When the string may have more sub-string matching the given regular expression, use PatternMatcherInput object as a parameter very useful. Replaced with PatternMatcherInput String object as a parameter, the above three methods of syntax is as follows:
• boolean matches (PatternMatcherInput input, Pattern pattern)
• boolean matchesPrefix (PatternMatcherInput input, Pattern pattern)
• boolean contains (PatternMatcherInput input, Pattern pattern)
Third, application example let's take a look at Jakarta-ORO library of some example.
3.1 log file processing tasks: analysis of a Web server log files to determine each user's time spent on the site. In a typical BEA WebLogic log file, logging the following format:

Analysis of the log records can be found, extracted from the contents of this log file there are two: IP addresses and page access time. You can use grouping symbols (parentheses) extracted from logging IP addresses and time stamp.
First we look at IP addresses. IP addresses are 4 bytes, with each byte value from 0 to 255, each byte separated by a period. Therefore, IP addresses in at least one of each byte, up to three figures. Figure 8 shows the IP addresses for the preparation of regular expressions:

Figure 8: Match IP Address
IP address in the end characters must be escaped processing (preceded by "\"), because IP address of the period with its original meaning, rather than using the regular expression syntax of the special meaning. Period in the regular expression in the special meaning have been introduced earlier in this article.
Part of the log record time surrounded by a square brackets. You can follow the following Si Luti out all the content inside the square brackets: the brackets first search starting character ("["), extract all the brackets does not exceed the end of the content of character ("]") forward looking until you find the end of the characters in square brackets. Figure 9 shows this part of the regular expressions.

Figure 9: match at least one character, until you find "]"
Now, put these two together with a regular expression grouping symbols (parentheses) after merging into a single regular expression, so that you can extract from logging IP addresses and time. Note that, in order to match "- -" (but do not extract it), the middle of adding a regular expression "\ s-\ s-\ s". Full regular expression as shown in Figure 10.

Figure 10: IP address and time stamp matching the regular expression has now been completed, then you can write using Java regular expression library code.
To use the Jakarta-ORO library, first create a regular expression string to be analyzed and logging string:

The regular expression used here in Figure 10 are almost identical to the regular expression, but one exception: in Java, you must forward each forward slash ("\") to escape treatment. Figure 10 is not Java's representation, we must each "\", add a "\" to avoid a compile error. Unfortunately, the escape process is prone to errors and should therefore be careful. You can not escape dealing with the first input of regular expressions, and then from left to right to each "\" replaced by "\ \." If you want to recheck, you can try to export it to the screen.
Initialization string, the instance of PatternCompiler object, PatternCompiler compiled regular expression to create a Pattern object:

Now, create PatternMatcher object, call PatternMatcher interface contain () method checks match the situation:

Next, use PatternMatcher MatchResult object interface, returning the output matching group. As logEntry string contains match, you can see the class as the following output:

3.2 HTML Treatment Example 1 The next task is to analyze the HTML pages of all properties within the FONT tag. A typical HTML page FONT tag are as follows:

Program will follow the form, the output of a FONT tag for each attribute:

In this case, I suggest you use two regular expressions. First shown in Figure 11, extract it from the font tag "" face = "Arial, Serif" size = "+2" color = "red" ".

Figure 11: FONT tags all the attributes matching the second regular expression as shown in Figure 12, which is separated into the various property name - value pairs.

Figure 12: Matching individual property, and separate it into the name - the value of the segmentation results:

Now we look at the Java code to complete this task. First create two regular expression string to compile them with Perl5Compiler Pattern object. When compiling regular expression, specify Perl5Compiler.CASE_INSENSITIVE_MASK option to make matching operation is not case sensitive.
Next, create an implementation of the matching operation Perl5Matcher object.

Suppose there is a variable of type String html, it represents a single line HTML file. If the string contains html FONT tag matcher will return true. At this point, you can match object returned MatchResult objects were the first group, which contains all the attributes FONT:

Next, create a PatternMatcherInput object. This object allows you to match the position from the last match started to operate, so it is suitable for extraction of FONT tag attribute name - value pairs. Create PatternMatcherInput object as a parameter to pass the string to be matched. Then, with a matching device to extract every instance of a FONT attribute. This is by specifying PatternMatcherInput object (rather than a string object) for the parameters, repeatedly calling PatternMatcher object contains () method to complete. PatternMatcherInput each iteration of the object into its internal pointer will move forward, the next test will be used to start a match the location of the back.
In this case the output is as follows:

3.3 HTML Treatment Example 2 Let us look at another example of dealing with HTML. This time, we assume that the Web server from widgets.acme.com moved newserver.acme.com. Now you have to modify some page links:

Implementation of the search regular expression shown in Figure 13:

Figure 13: Matching the link before the amendment
If you can match the regular expression, you can replace the map with the following contents of the 13 links:

Note # character followed by a $ 1. Perl regular expression syntax to use $ 1, $ 2, etc. that have been matched and extracted from the group. Figure 13 The expression of all as a group match and extract the contents of the link attached to the back.
Now, back to Java. As we did earlier, you must create a test string to create the regular expression compiled Pattern object to the necessary objects, and create a PatternMatcher object:

Next, using com.oroinc.text.regex package Util class substitute () static method to replace the output string:

Util.substitute () method syntax is as follows:

The call of the first two parameters are previously created PatternMatcher and Pattern object. The third parameter is a Substiution objects, how it determined the replacement operation carried out. This example uses the Perl5Substitution object, it can be Perl5-style replacement. The fourth parameter is the string you want to replace the operation, the last parameter allows you to specify whether the replacement model for all matching substring (Util.SUBSTITUTE_ALL), or just replace the specified number of times.
【Conclusion】 In this article, I introduce you to the power of regular expressions. As long as the correct use of regular expressions to modify the string extraction and text to play a significant role. In addition, I also describes how to program in Java Jakarta-ORO library by using regular expressions. The ultimate use of the old-fashioned string approach (using StringTokenizer, charAt, and substring), or using regular expressions, which decided to be your own.

相关文章
  • [Change] Java regular expression summary 2010-09-24

    Study: Wind in the bamboo Time: 2010-09-21 Source: http://www.cnblogs.com/fzzl/archive/2010/09/21/1832794.html Google search with "Java regular expression" is not very easy to find especially good on the java regular expression summary of the ar

  • Java Regular Expression Application Summary 2010-03-29

    Java Regular Expression Application Summary First, an overview of Regular expressions are Java handling strings, the text of the important tool. Java on the handling of regular expressions in the following two two classes: java.util.regex.Matcher pat

  • JAVA regular expression syntax (switch) 2010-07-09

    JAVA regular expression syntax (switch) Regular expression syntax Regular expression is a text mode, including ordinary characters (for example, a to z between the letters) and special characters (called "meta characters"). Model described in th

  • Java regular expression Raiders (a) 2010-11-18

    Java regular expression Raiders (a) [2010-04-23 12:42:10.0 | On: Caprice categories: basic enhancement] Source: Network collected here in 1954 labels: Java regular expression Raiders (a) of the text characters in java regular expression special chara

  • Java regular expression explanation 2010-03-29

    Java regular expression explanation expression meaning: 1. Character x character x. For example, a character that a \ \ Backslash character. In the writing time to write to \ \ \ \. (Note: Because java in the first analysis, they can put \ \ \ \ pars

  • (Transfer) java regular expression Xiangjie 2010-04-12

    Java regular expression Xiangjie If you have used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching and text, and how easy it is. If you are not familiar with this term, the

  • Java regular expression extract 2010-04-13

    Regular Expression Tutorial 30 minutes http://deerchao.net/tutorials/regex/regex.htm javascript regular expression - Stuart is the United States http://www.cnblogs.com/rubylouvre/archive/2010/03/09/1681222.html Java regular expression Xiangjie http:/

  • java regular expression escape 2010-04-25

    Learning java regular expression encountered three problems. 1, java strings and string pattern is very clear 2, there is the concept of capturing group, and also of the capture group after the replacement string, which appendReplacement (StringBuffe

  • [Change] Js regular expression 2010-06-04

    [Change] Js regular expression Ghost on [2008-01-17 23:17] / / Check if all the numbers var patrn = / ^ [0-9] (1,20) $ / / / Check login name: Enter only 5-20 months beginning with the letter can be a string with digital ,"_","." var p

  • Java regular expression Xiangjie 2010-07-10

    Java regular expression Xiangjie On 2005-10-08 10:01: Cactus Studio Source: KissJava.com Editor: Wang Yuhan, if you have used Perl or any other built-in support for regular expression language, you must know to use regular expressions to process text

  • Java regular expression Xiangjie (Reprinted) 2010-08-23

    First, the basics of regular expressions We start with a simple start. Suppose you want to search for a containing characters "cat" string search with regular expression is "cat". If the search is not case sensitive, the word "cat

  • java regular expression based 2010-08-27

    Scanty previous regular expression, feeling good enough, but a while ago because of its low-level regular expression led to a misuse of online failure, deeply ashamed, they still re-lay the foundation for it. Concept: A regular expression is a patter

  • Detailed Java regular expression 2010-10-09

    If you've ever used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching to process text and how easy it is. If you are not familiar with this term, then the "regular expressio

  • Java regular expression Raiders (b) 2010-11-18

    Java regular expression Raiders (b) [2010-04-23 12:43:42.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 1512 labels: Java regular expression Raiders (b) java regular expression characters The regular expression engin

  • Java regular expression Raiders (c) 2010-11-18

    Java regular expression Raiders (c) [2010-04-23 12:44:39.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 2057 labels: Java regular expression Raiders (c) java regular expression quantifier / qualifier Greedy greedy mo

  • java regular expression matching Chinese 2011-01-25

    Java regular expression to match the Chinese characters? The following examples are given so that we match all the Chinese characters: public static void regxChinese(){ // The string to match String source = "<span title='5 Star hotel '>";

  • Java Regular Expression Application summary (change) 2010-03-29

    First, an overview of A regular expression is Java Dealing with strings, the text of the important tool. Java The processing of regular expressions in the following two two classes: java.util.regex.Matcher pattern class: used to indicate a compiled r

  • Java regular expression API Summary 2010-06-04

    Starting from Java1.4, Java core API java.util.regex on the introduction of the package, it is a valuable foundation tool for many types of text processing, such as matching, search, extraction and analysis of structured content . java.util.regex is

  • java regular expression parsing using LRC files 2010-11-15

    Teachers to follow Mars Writing with android in Mp3 player What benefit Thanks again for the teacher's dedication But really a lot of problems which Feel the teacher is not enough to do code refactoring One of the LRC file also get a relatively slopp

  • application of java regular expression 2010-06-06

    First of all, the point of java key concept of regular expressions: First, the relevant class: Pattern, Matcher Second, the typical call sequence is Pattern p = Pattern.compile ("a * b"); Matcher m = p.matcher ("aaaaab"); boolean b = m