Java regular expression Xiangjie (Reprinted)

2010-08-23  来源:本站原创  分类:Java  人气:192 

First, the basics of regular expressions
We start with a simple start. Suppose you want to search for a containing characters "cat" string search with regular expression is "cat". If the search is not case sensitive, the word "catalog", "Catherine", "sophisticated" can match. That is:
Java regular expression Xiangjie (Reprinted)
1.1 point symbol
Suppose you are playing the English spelling games, want to find the three-letter words, and these words must be "t" letters at the beginning, to "n" the letter concluded. In addition, assume that an English dictionary, you can use regular expression search all its content. To construct the regular expression, you can use a wildcard - dot symbol ".". Thus, the complete expression is "tn", it matches the "tan", "ten", "tin" and "ton", also matches the "t # n", "tpn" or "tn", there are many other non- significance of the combination. This is because the dot symbol matches all characters, including spaces, Tab characters or line breaks:
Java regular expression Xiangjie (Reprinted)
1.2 Symbols in square brackets
In order to solve the matching symbol period is so wide on this issue, you can specify in square brackets ("[]") which seems interesting characters. At this point, only the specified character inside the brackets to be involved in matches. That is, the regular expression "t [aeio] n" matches only "tan", "Ten", "tin" and "ton". But the "Toon" does not match, because in the square brackets you can only match a single character:
Java regular expression Xiangjie (Reprinted)
1.3 "or" symbol
If you match all the words in addition to the above addition, you want to match "toon", then you can use "|" operator. "|" Operator is the basic meaning of "or" computing. To match the "toon", use "t (a | e | i | o | oo) n" regular expressions. Here the user can not expand the number, because the brackets allow only matching a single character; here must use parentheses "()"。 Parentheses can also be used to block specific description, see later.
Java regular expression Xiangjie (Reprinted)
1.4 the number of symbols that match
Table 1 shows that the number of matching symbols, the symbols used to determine the sign next to the symbol on the left of the number of occurrences:
Java regular expression Xiangjie (Reprinted)

Suppose we want to search in a text file the U.S. social security number. The format of this number is 999-99-9999. It is used to match the regular expression shown in Figure 1. In the regular expression, the hyphen ("-") is of special significance, it represents a range, such as from 0 to 9. Therefore, the matching Social Security number in the hyphen, it should be added in front of an escape character "\."
Java regular expression Xiangjie (Reprinted)

Figure 1: Match all forms of social security number 123-12-1234

Suppose the search, you want hyphen may appear, it may not occur - that is, ,999-99-9999 and 999 999 999 belong to the correct format. At this point, you can hyphen after "?" Limited number of symbols, as shown in Figure 2:
Java regular expression Xiangjie (Reprinted)

Figure 2: Match all 123-12-1234 and forms of social security number 123 121 234

Let's look at another example. American car license format is four numbers plus two letters. It is in front of the digital part of the expression "[0-9] (4)", together with the letter part of the "[AZ] (2)". Figure 3 shows the complete regular expressions.
Java regular expression Xiangjie (Reprinted)

Figure 3: Matching the typical American car license plate number, if 8836KV

1.5 "no" symbol
"^" Symbol known as the "no" symbol. If used in the square brackets, "^" indicates the character you want to match. For example, Figure 4 of the regular expression matching all words, but "X" letter at the beginning of the word except.
Java regular expression Xiangjie (Reprinted)

Figure 4: Match all words, but "X" at the beginning, except

1.6 parentheses and empty symbols
Assume from the format for the "June 26, 1951" to extract the birthday month of the date part, to match the date of the regular expression can be shown in Figure 5:
Java regular expression Xiangjie (Reprinted)

Figure 5: Match all Moth DD, YYYY date format

Emerging "\ s" symbol is a blank symbol, matching all white space characters, including the Tab character. If the string matches correctly, then how to extract the month part? Just add a parenthesis around the month to create a group, then ORO API (discussed in detail later in this article) to extract its value. The modified regular expression shown in Figure 6:
Java regular expression Xiangjie (Reprinted)

Figure 6: Match All Month DD, YYYY format the date, the definition of value of the first group of the month

1.7 Other symbols
For simplicity, you can use some regular expression for the common shortcut created symbols. Table 2 below:
Table 2: Common symbols
Java regular expression Xiangjie (Reprinted)

For example, social security number in the previous example, all occurrences of "[0-9]" the place where we can use the "\ d". The modified regular expression shown in Figure 7:
Java regular expression Xiangjie (Reprinted)

Figure 7: Matching all the social security number 123-12-1234 format

2, Jakarta-ORO library
There are many open source regular expression library for Java programmers to use, and many of them support Perl 5 compatible regular expression syntax. Here I use the Jakarta-ORO regular expression library, which is the most comprehensive one of the regular expression API, and it is with Perl 5 compatible regular expressions. In addition, it is optimization of one of the best of the API.
Jakarta-ORO library was called the OROMatcher, Daniel Savarese generously donated it to the Jakarta Project. Finally, you can follow the instructions to download it for reference resources.
First, I will briefly introduce the use Jakarta-ORO library, you must create and access objects, and then describes how to use the Jakarta-ORO API.
▲ PatternCompiler object
First, create a Perl5Compiler instance of the class and assign it to PatternCompiler interface object. Perl5Compiler PatternCompiler interface is an implementation that allows you to compile a regular expression to match the Pattern object.
Java regular expression Xiangjie (Reprinted)
▲ Pattern object
We should compile a regular expression Pattern object, call the compiler object compile () method, and call the regular expression specified in the argument. For example, you can compile in this way according to the following regular expression "t [aeio] n":
Java regular expression Xiangjie (Reprinted)
By default, the compiler creates a case sensitive mode (pattern). Therefore, the above code compiled by the pattern matches only "tin", "tan", "ten" and "ton", but does not match the "Tin" and "taN". To create a case-insensitive mode, you should call the compiler at the time designated an additional parameter:
Java regular expression Xiangjie (Reprinted)
Pattern object is created, you can use the Pattern class by PatternMatcher object pattern matching.
▲ PatternMatcher object
PatternMatcher Object Pattern object and a string under the matching check. You have to instantiate a Perl5Matcher class and the results assigned to PatternMatcher interfaces. Perl5Matcher PatternMatcher interface class is an implementation of it under the Perl 5 regular expression syntax for pattern matching:
Java regular expression Xiangjie (Reprinted)
Use PatternMatcher object, you can use several methods of matching operation, the first parameter of these methods are the need to match the regular expression string:
· Boolean matches (String input, Pattern pattern): When the input string and regular expression to use when an exact match. In other words, regular expressions must be a complete description of input string.
· Boolean matchesPrefix (String input, Pattern pattern): When the regular expression match the input string is used when starting part.
· Boolean contains (String input, Pattern pattern): When the regular expression to match the input string is used as part of (that is, it must be a substring).
In addition, the above three method calls, you can also use the object as a parameter substitution PatternMatcherInput String object; this time, you can string the last match to match the location of the beginning. When the string may have more sub-string matching the given regular expression, use PatternMatcherInput object as a parameter very useful. Replaced with PatternMatcherInput String object as a parameter, the above three methods of syntax is as follows:
· Boolean matches (PatternMatcherInput input, Pattern pattern)
· Boolean matchesPrefix (PatternMatcherInput input, Pattern pattern)
· Boolean contains (PatternMatcherInput input, Pattern pattern)
III Application
Let us look at Jakarta-ORO library of some example.
3.1 log file processing
Tasks: analysis of a Web server log files to determine each user's time spent on the site. In a typical BEA WebLogic log file, logging the following format:
Java regular expression Xiangjie (Reprinted)
Analysis of the log records can be found, extracted from the contents of this log file there are two: IP addresses and page access time. You can use grouping symbols (parentheses) extracted from logging IP addresses and time stamp.
First we look at IP addresses. IP addresses are 4 bytes, with each byte value from 0 to 255, each byte separated by a period. Therefore, IP addresses in at least one of each byte, up to three figures. Figure 8 shows the IP addresses for the preparation of regular expressions:
Java regular expression Xiangjie (Reprinted)

Figure 8: Match IP Address

IP address in the end characters must be escaped processing (preceded by "\"), because IP address of the period with its original meaning, rather than using the regular expression syntax of the special meaning. Period in the regular expression in the special meaning have been introduced earlier in this article.
Part of the log record time surrounded by a square brackets. You can follow the following Si Luti out all the content inside the square brackets: the brackets first search starting character ("["), extract all the brackets does not exceed the end of the content of character ("]") forward looking until you find the end of the characters in square brackets. Figure 9 shows this part of the regular expressions.
Java regular expression Xiangjie (Reprinted)

Figure 9: match at least one character, until you find "]"

Now, to the two regular expressions with grouping symbols (parentheses) after combined into a single expression, so that you can extract from logging IP addresses and time. Note that, in order to match "- -" (but do not extract it), the middle of adding a regular expression "\ s-\ s-\ s". Full regular expression as shown in Figure 10.
Java regular expression Xiangjie (Reprinted)

Figure 10: matching IP address and time stamp

Now the regular expression has been prepared, then you can write using Java regular expression library code.
To use the Jakarta-ORO library, first create a regular expression string to be analyzed and logging string:
Java regular expression Xiangjie (Reprinted)
The regular expression used here in Figure 10 are almost identical to the regular expression, but one exception: in Java, you must forward each forward slash ("\") to escape treatment. Figure 10 is not Java's representation, we must each "\", add a "\" to avoid a compile error. Unfortunately, the escape process is prone to errors and should therefore be careful. You can not escape dealing with the first input of regular expressions, and then from left to right to each "\" replaced by "\ \." If you want to recheck, you can try to export it to the screen.
Initialization string, the instance of PatternCompiler object, PatternCompiler compiled regular expression to create a Pattern object:
Java regular expression Xiangjie (Reprinted)
Now, create PatternMatcher object, call PatternMatcher interface contain () method checks match the situation:
Java regular expression Xiangjie (Reprinted)
Next, use PatternMatcher MatchResult object interface, returning the output matching group. As logEntry string contains match, you can see the class as the following output:
Java regular expression Xiangjie (Reprinted)
Examples of dealing with a 3.2 HTML
The next task is to analyze the HTML pages of all properties within the FONT tag. A typical HTML page FONT tag are as follows:
Java regular expression Xiangjie (Reprinted)
Program will follow the form, the output of a FONT tag for each attribute:
Java regular expression Xiangjie (Reprinted)
In this case, I suggest you use two regular expressions. First shown in Figure 11, extract it from the font tag "" face = "Arial, Serif" size = "+2" color = "red" ".
Java regular expression Xiangjie (Reprinted)

Figure 11: FONT tags all the attributes match

The second regular expression as shown in Figure 12, which is separated into the various property name - value pairs.
Java regular expression Xiangjie (Reprinted)

Figure 12: Matching individual property, and separate it into the name - value pairs

Segmentation results:
Java regular expression Xiangjie (Reprinted)
Now we look at the Java code to complete this task. First create two regular expression string to compile them with Perl5Compiler Pattern object. When compiling regular expression, specify Perl5Compiler.CASE_INSENSITIVE_MASK option to make matching operation is not case sensitive.
Next, create an implementation of the matching operation Perl5Matcher object.
Java regular expression Xiangjie (Reprinted)
Suppose there is a variable of type String html, it represents a single line HTML file. If the string contains html FONT tag matcher will return true. At this point, you can match object returned MatchResult objects were the first group, which contains all the attributes FONT:
Java regular expression Xiangjie (Reprinted)
Next, create a PatternMatcherInput object. This object allows you to match the position from the last match started to operate, so it is suitable for extraction of FONT tag attribute name - value pairs. Create PatternMatcherInput object as a parameter to pass the string to be matched. Then, with a matching device to extract every instance of a FONT attribute. This is by specifying PatternMatcherInput object (rather than a string object) for the parameters, repeatedly calling PatternMatcher object contains () method to complete. PatternMatcherInput each iteration of the object into its internal pointer will move forward, the next test will be used to start a match the location of the back.
In this case the output is as follows:
Java regular expression Xiangjie (Reprinted)
3.3 HTML instance of two dealing with
Let us look at another example of dealing with HTML. This time, we assume that the Web server from widgets.acme.com moved newserver.acme.com. Now you have to modify some page links:
Java regular expression Xiangjie (Reprinted)
Implementation of the search regular expression shown in Figure 13:
Java regular expression Xiangjie (Reprinted)

Figure 13: Matching the link before the amendment

If you can match the regular expression, you can replace the map with the following contents of the 13 links:
Java regular expression Xiangjie (Reprinted)
Note # character followed by a $ 1. Perl regular expression syntax to use $ 1, $ 2, etc. that have been matched and extracted from the group. Figure 13 The expression of all as a group match and extract the contents of the link attached to the back.
Now, back to Java. As we did earlier, you must create a test string to create the regular expression compiled Pattern object to the necessary objects, and create a PatternMatcher object: Java regular expression Xiangjie (Reprinted)
Next, using com.oroinc.text.regex package Util class substitute () static method to replace the output string:
Java regular expression Xiangjie (Reprinted)
Util.substitute () method syntax is as follows:
Java regular expression Xiangjie (Reprinted)
The call of the first two parameters are previously created PatternMatcher and Pattern object. The third parameter is a Substiution object, which determines how the replacement operation. This example uses the Perl5Substitution object, it can be Perl5-style replacement. The fourth parameter is the string you want to replace the operation, the last parameter allows you to specify whether the replacement model for all matching substring (Util.SUBSTITUTE_ALL), or just replace the specified number of times.

相关文章
  • Java regular expression Xiangjie (Reprinted) 2010-08-23

    First, the basics of regular expressions We start with a simple start. Suppose you want to search for a containing characters "cat" string search with regular expression is "cat". If the search is not case sensitive, the word "cat

  • (Transfer) java regular expression Xiangjie 2010-04-12

    Java regular expression Xiangjie If you have used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching and text, and how easy it is. If you are not familiar with this term, the

  • Java regular expression Xiangjie 2010-07-10

    Java regular expression Xiangjie On 2005-10-08 10:01: Cactus Studio Source: KissJava.com Editor: Wang Yuhan, if you have used Perl or any other built-in support for regular expression language, you must know to use regular expressions to process text

  • Java regular expression extract 2010-04-13

    Regular Expression Tutorial 30 minutes http://deerchao.net/tutorials/regex/regex.htm javascript regular expression - Stuart is the United States http://www.cnblogs.com/rubylouvre/archive/2010/03/09/1681222.html Java regular expression Xiangjie http:/

  • [Change] Java regular expression summary 2010-09-24

    Study: Wind in the bamboo Time: 2010-09-21 Source: http://www.cnblogs.com/fzzl/archive/2010/09/21/1832794.html Google search with "Java regular expression" is not very easy to find especially good on the java regular expression summary of the ar

  • Java regular expression explanation 2010-03-29

    Java regular expression explanation expression meaning: 1. Character x character x. For example, a character that a \ \ Backslash character. In the writing time to write to \ \ \ \. (Note: Because java in the first analysis, they can put \ \ \ \ pars

  • Java Regular Expression Application Summary 2010-03-29

    Java Regular Expression Application Summary First, an overview of Regular expressions are Java handling strings, the text of the important tool. Java on the handling of regular expressions in the following two two classes: java.util.regex.Matcher pat

  • java regular expression escape 2010-04-25

    Learning java regular expression encountered three problems. 1, java strings and string pattern is very clear 2, there is the concept of capturing group, and also of the capture group after the replacement string, which appendReplacement (StringBuffe

  • JAVA regular expression syntax (switch) 2010-07-09

    JAVA regular expression syntax (switch) Regular expression syntax Regular expression is a text mode, including ordinary characters (for example, a to z between the letters) and special characters (called "meta characters"). Model described in th

  • java regular expression based 2010-08-27

    Scanty previous regular expression, feeling good enough, but a while ago because of its low-level regular expression led to a misuse of online failure, deeply ashamed, they still re-lay the foundation for it. Concept: A regular expression is a patter

  • Detailed Java regular expression 2010-10-09

    If you've ever used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching to process text and how easy it is. If you are not familiar with this term, then the "regular expressio

  • Java regular expression Raiders (a) 2010-11-18

    Java regular expression Raiders (a) [2010-04-23 12:42:10.0 | On: Caprice categories: basic enhancement] Source: Network collected here in 1954 labels: Java regular expression Raiders (a) of the text characters in java regular expression special chara

  • Java regular expression Raiders (b) 2010-11-18

    Java regular expression Raiders (b) [2010-04-23 12:43:42.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 1512 labels: Java regular expression Raiders (b) java regular expression characters The regular expression engin

  • Java regular expression Raiders (c) 2010-11-18

    Java regular expression Raiders (c) [2010-04-23 12:44:39.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 2057 labels: Java regular expression Raiders (c) java regular expression quantifier / qualifier Greedy greedy mo

  • java regular expression matching Chinese 2011-01-25

    Java regular expression to match the Chinese characters? The following examples are given so that we match all the Chinese characters: public static void regxChinese(){ // The string to match String source = "<span title='5 Star hotel '>";

  • Java Regular Expression Application summary (change) 2010-03-29

    First, an overview of A regular expression is Java Dealing with strings, the text of the important tool. Java The processing of regular expressions in the following two two classes: java.util.regex.Matcher pattern class: used to indicate a compiled r

  • Java regular expression in a variety of characters, and an explanation of escaped characters 2010-04-11

    1. Character x character x. For example, a character that a \ \ Backslash character. In writing when the write to \ \ \ \. (Note: because java when the first resolution, to \ \ \ \ parsing into the regular expression \ \, again when the second resolu

  • Java regular expression API Summary 2010-06-04

    Starting from Java1.4, Java core API java.util.regex on the introduction of the package, it is a valuable foundation tool for many types of text processing, such as matching, search, extraction and analysis of structured content . java.util.regex is

  • application of java regular expression 2010-06-06

    First of all, the point of java key concept of regular expressions: First, the relevant class: Pattern, Matcher Second, the typical call sequence is Pattern p = Pattern.compile ("a * b"); Matcher m = p.matcher ("aaaaab"); boolean b = m

  • java regular expression study 2010-06-11

    1. Character x character x. For example, a character that a \ \ Backslash character. In writing when the write to \ \ \ \ . (Note: because java when the first resolution, to \ \ \ \ parsing into the regular expression \ \, again when the second resol