Detailed Java regular expression

2010-10-09  来源:本站原创  分类:Java  人气:203 

If you've ever used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching to process text and how easy it is. If you are not familiar with this term, then the "regular expression" (Regular Expression) is a string of characters that defines a pattern to search for a matching string.
Many languages, including Perl, PHP, Python, JavaScript and JScript, both support the use of regular expressions with text, some text editors Used for regular expressions for advanced "search - replace" function. Then Java so what? Of this writing, one that contains the text with regular expressions Java Specification Request processing (Specification Request) has been approved, you can expect the next version of JDK to see it.
However, if we need to use regular expressions, how should we do? You can download the source code is open from Apache.org the Jakarta-ORO library. The following article briefly introduces the contents of the first introduction to the knowledge of regular expressions, and then Jakarta-ORO API as an example of how to use regular expressions.
First, the basics of regular expressions
We start with a simple start. Suppose you want to search for a character contains "cat" in the string, the search with regular expressions is "cat". If the search is not case sensitive, the word "catalog", "Catherine", "sophisticated" can match. That is:
Detailed Java regular expression
1.1 point symbol
Suppose you are playing Scrabble in English and want to find out the three-letter words, and these words must be "t" letter to "n" the letter concluded. In addition, assume that an English dictionary, you can use regular expressions to search for its entirety. To construct the regular expression, you can use a wildcard - dot symbol ".". Thus, the full expression is "tn", it matches the "tan", "ten", "tin" and "ton", but also match the "t # n", "tpn" or "tn", there are many other non- significance of combination. This is because the dot symbol matches all characters, including spaces, Tab characters or line breaks:
Detailed Java regular expression
1.2 Symbols in square brackets
To address the period of the symbol matching problem too broad, you can specify inside the square brackets ("[]") seems meaningful characters. At this point, only the square brackets matches the specified character to be involved. In other words, the regular expression "t [aeio] n" matches only "tan", "Ten", "tin" and "ton". But the "Toon" does not match, because in the square brackets you can only match a single character:
Detailed Java regular expression
1.3 "or" symbol
In addition to the above matches if all the words, you are also want to match "toon", then you can use the "|" operator. "|" Operator is the basic meaning of "or" operation. To match "toon", use the "t (a | e | i | o | oo) n" regular expressions. This extension number can not use the square as square brackets matches only a single character; here must use parentheses "()"。 Parentheses can also be used to group details, see described later.
Detailed Java regular expression
1.4 the number of symbols that match
Table I shows that the number of matching symbols, the symbols used to identify the symbol next to the symbol on the left of the number of occurrences:
Detailed Java regular expression

Suppose we want to search for in the text file the U.S. social security number. The format of this number is 999-99-9999. It is used to match the regular expression shown in Fig. In the regular expression, a hyphen ("-") is of special significance, it represents a range, such as from 0 to 9. Therefore, the matching Social Security numbers in the hyphen, it's front to add an escape character "\."
Detailed Java regular expression

Figure I: match all forms of social security number 123-12-1234

Assuming the search, you want a hyphen may appear, it may not occur - that ,999 -99-9999 and 999999999 are all in the correct format. At this time, you can add a hyphen after "?" Limited number of symbols, as shown in Figure II:
Detailed Java regular expression

Figure II: 123-12-1234 and match all forms of social security number 123121234

Let us look at another example. American car license format is four numbers plus two letters. It is a number in front of the regular expression part of "[0-9] {4}", together with the letter part of the "[AZ] {2}". Figure III shows the complete regular expressions.
Detailed Java regular expression

Figure Three: Match the typical American car license plate number, if 8836KV

1.5 "no" symbol
"^" Symbol known as the "no" symbol. If used in the square brackets, "^" character that does not want to match. For example, Figure IV regular expression matching all the words, but "X" except words beginning with the letter.
Detailed Java regular expression

Figure Four: Match all the words, but the "X" at the beginning, except

1.6 parentheses and whitespace
Assuming from the format of "June 26, 1951" birth dates to extract the month part, to match the date of the regular expression shown in Figure Five:
Detailed Java regular expression

Figure Five: match all Moth DD, YYYY date format

Emerging "\ s" symbol is a blank symbol, matching all white space characters, including the Tab character. If the string matches correctly, then how to extract the month part? Just around the month to create a group with a parenthesis, then ORO API (discussed in detail later in this article) to extract its value. The modified regular expression shown in Figure Six:
Detailed Java regular expression

Figure VI: Match All Month DD, YYYY format the date, the month is the first of a group defined

1.7 Other symbols
For simplicity, you can use some regular expressions created for the common shortcut symbol. As shown in Table II:
Table II: Symbol
Detailed Java regular expression

For example, social security number in the previous example, all occurrences of "[0-9]," the place where we can use "\ d". The modified regular expression shown in Figure VII:
Detailed Java regular expression

Figure seven: matching format of all the social security number 123-12-1234

Second, Jakarta-ORO library
There are many open source regular expression library for Java programmers to use, and many of them support Perl 5-compatible regular expression syntax. Here I use the Jakarta-ORO regular expression library, which is the most comprehensive one of the regular expression API, and it is with Perl 5 compatible regular expressions. In addition, it is optimization of one of the best of the API.
Jakarta-ORO library formerly called OROMatcher, Daniel Savarese generously donated it to the Jakarta Project. Finally, you can follow the instructions to download it for reference resources.
First, I will briefly use the Jakarta-ORO library, you must create and access the object, and then describes how to use the Jakarta-ORO API.
▲ PatternCompiler object
First, create a Perl5Compiler instance of the class and assign it to PatternCompiler interface object. Perl5Compiler is PatternCompiler an implementation of the interface that allows you to compile a regular expression to match the Pattern object.
Detailed Java regular expression
▲ Pattern object
Compiled regular expression should Pattern object, call the compiler object compile () method, and call the regular expression specified in the argument. For example, you can follow this way, compiling a regular expression "t [aeio] n":
Detailed Java regular expression
By default, the compiler creates a case sensitive mode (pattern). Therefore, the above code is compiled by the pattern matches only "tin", "tan", "ten" and "ton", but does not match the "Tin" and "taN". To create a case-insensitive mode, you should specify when invoking the compiler an additional parameter:
Detailed Java regular expression
Pattern object created, you can use the Pattern class by PatternMatcher object pattern matching.
▲ PatternMatcher object
PatternMatcher Pattern object and the object based on string matching check. Perl5Matcher you instantiate a class and assign the result to PatternMatcher interface. Perl5Matcher PatternMatcher interface class is an implementation of it based on Perl 5 regular expression syntax for pattern matching:
Detailed Java regular expression
Use PatternMatcher object, you can use multiple methods for matching, the first parameter of these methods are the need to match the regular expression string:
· Boolean matches (String input, Pattern pattern): When the input string and regular expression to use when an exact match. In other words, the regular expression must be a complete description of the input string.
· Boolean matchesPrefix (String input, Pattern pattern): When the input string matches the regular expression used when initial part.
· Boolean contains (String input, Pattern pattern): When the regular expression to match the use of part of the input string (that is, it must be a substring).
In addition, the call in the above three methods, you can substitute with PatternMatcherInput String object as a parameter object; this time, you can match the last time from a string to match the location of the beginning. When the string may have more sub-string matching the given regular expression, use PatternMatcherInput object as a parameter is very useful. Replacement with PatternMatcherInput String object as a parameter, the above three methods the following syntax:
· Boolean matches (PatternMatcherInput input, Pattern pattern)
· Boolean matchesPrefix (PatternMatcherInput input, Pattern pattern)
· Boolean contains (PatternMatcherInput input, Pattern pattern)
Third, the application example
Let us look at Jakarta-ORO library some application examples.
3.1 log file processing
Tasks: analysis of a Web server log files to determine each user's time spent on the site. In a typical BEA WebLogic log file, logging the following format:
Detailed Java regular expression
Analysis of the log records can be found, extracted from the contents of the log file there are two: IP addresses and page access time. You can use grouping symbols (parentheses) extracted from logging IP addresses and time stamp.
First we look at IP addresses. IP addresses are 4 bytes form, each byte value between 0 and 255, each byte separated by a period. Therefore, IP address of each byte has at least one, up to three digits. Figure VIII shows the preparation for the IP address regular expression:
Detailed Java regular expression

Figure Eight: Match IP Address

Period in the IP address to be escaping characters (preceded by "\"), because IP address of the period with its original meaning, rather than using the regular expression syntax of the special meaning. Period in the regular expression in the special meaning has been introduced earlier in this article.
Part of the log record time surrounded by a square brackets. You can follow Si Luti remove all the contents inside the brackets: the brackets first character ("["), search starting extract all the brackets does not exceed the end of the content of character ("]") forward looking until you find the end of the brackets characters. Figure IX shows this part of the regular expressions.
Detailed Java regular expression

Figure IX: match at least one character, until you find "]"

Now, to the two regular expressions with grouping symbols (parentheses) after combined into a single expression, so that you can extract from logging IP addresses and time. Note that, in order to match "- -" (but do not extract it), joined the middle of a regular expression "\ s-\ s-\ s". Full regular expression as shown in Figure Ten.
Detailed Java regular expression

Figure Ten: Match IP address and time stamp

Now the regular expression has been prepared, then you can write using Java regular expression library code.
To use the Jakarta-ORO library, first create a regular expression string and a string of logging to be analyzed:
Detailed Java regular expression
The regular expression used here with the figure ten are almost identical to the regular expression, but with one exception: in Java, you must forward each slash ("\") escaping. Figure X is not a representation of Java, so we want each "\", add a "\" to avoid a compilation error. Unfortunately, the escape process is prone to error, so be careful. You can not escape dealing with the first input of regular expressions, and then from left to right to each "\" replaced by "\ \." If you want to re-examination, you can try to export it to the screen.
Initialization string, the instance of PatternCompiler object, PatternCompiler compile a regular expression to create a Pattern object:
Detailed Java regular expression
Now, create PatternMatcher object, call PatternMatcher interface contain () method checks the match conditions:
Detailed Java regular expression
Next, use the interface returned MatchResult PatternMatcher object, the output matching group. As logEntry string contains match, you can see the class as in the following output:
Detailed Java regular expression
3.2 HTML Example of a deal
The following HTML pages of a task is to analyze all the properties within the FONT tag. A typical HTML page FONT tag are as follows:
Detailed Java regular expression
Program will follow the form, the output of each FONT tag attributes:
Detailed Java regular expression
In this case, I suggest you use two regular expressions. The first is shown in Figure XI, which is extracted from the font tag "" face = "Arial, Serif" size = "+2" color = "red" ".
Detailed Java regular expression

Figure XI: match all attributes of FONT tag

The second regular expression shown in Figure Twelve, which split into the individual property name - value pairs.
Detailed Java regular expression

Figure XII: match a single property, and to split it into name - value pairs

Segmentation results are:
Detailed Java regular expression
Now we take a look at this task of Java code. First create a string of two regular expressions, compile them with Perl5Compiler Pattern object. When compiling a regular expression, specify the Perl5Compiler.CASE_INSENSITIVE_MASK options, making the matching operation is not case sensitive.
Next, create an implementation of matching operations Perl5Matcher object.
Detailed Java regular expression
Suppose there is a variable of type String html, it represents a line in the HTML document. If the string contains html FONT tag matcher will return true. At this point, you can match the object returned by the first group received MatchResult object, which contains all the attributes of FONT:
Detailed Java regular expression
Next create a PatternMatcherInput object. This object allows you to match the position from the last match started to operate, so it is suitable for extraction of FONT tag attribute name - value pairs. Create PatternMatcherInput object passed as a parameter string to be matched. Then, using the matching instance FONT extract the properties of each. This object by specifying PatternMatcherInput (rather than a string object) as a parameter, repeatedly calling PatternMatcher object contains () method to complete. PatternMatcherInput each iteration of the object into its internal pointer will move forward, the next test match will be the location that was once behind the start.
The output of this example is as follows:
Detailed Java regular expression
3.3 HTML Example II Treatment
Let us look at another example of dealing with HTML. This time, we assume that the Web server is moved from the widgets.acme.com newserver.acme.com. Now you have to modify some page links:
Detailed Java regular expression
Implementation of the search regular expression shown in Figure Thirteen:
Detailed Java regular expression

Figure XIII: match the link before the amendment

If you can match the regular expression, you can replace the map with the following contents of the thirteen links:
Detailed Java regular expression
Note # character followed by a $ 1. Perl regular expression syntax to use $ 1, $ 2, etc. that have been matched and extracted from the group. Figure XIII expression matches all as a group and extract the contents of the links attached to the back.
Now, return to Java. As we did earlier, you must create a test string to create the regular expression object must be compiled into a Pattern object, and create a PatternMatcher object: Detailed Java regular expression
Next, com.oroinc.text.regex package Util class substitute () static method to be replaced, the output string:
Detailed Java regular expression
Util.substitute () method syntax is as follows:
Detailed Java regular expression
The first two parameters of this call was previously created PatternMatcher and Pattern object. The third parameter is a Substiution object, which determines how the replacement operation. This example uses the Perl5Substitution object, it can be replaced Perl5 style. The fourth argument is the string you want to replace the operation, the last parameter allows you to specify whether to replace all occurrences of substring pattern (Util.SUBSTITUTE_ALL), or just replace the specified number of times.
【Conclusion】 In this article, I introduce you to the power of regular expressions. As long as the correct use of regular expressions to modify the text in the string extraction and played a significant role. In addition, I also described how to program in Java Jakarta-ORO library by using regular expressions. As regards the eventual adoption of a string of old-fashioned approach (using StringTokenizer, charAt, and substring), or the use of regular expressions, which need to be your own decision.

相关文章
  • Detailed Java regular expression 2010-10-09

    If you've ever used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching to process text and how easy it is. If you are not familiar with this term, then the "regular expressio

  • Java Regular Expression Application Summary 2010-03-29

    Java Regular Expression Application Summary First, an overview of Regular expressions are Java handling strings, the text of the important tool. Java on the handling of regular expressions in the following two two classes: java.util.regex.Matcher pat

  • Java regular expression explanation 2010-03-29

    Java regular expression explanation expression meaning: 1. Character x character x. For example, a character that a \ \ Backslash character. In the writing time to write to \ \ \ \. (Note: Because java in the first analysis, they can put \ \ \ \ pars

  • (Transfer) java regular expression Xiangjie 2010-04-12

    Java regular expression Xiangjie If you have used Perl or any other built-in support for regular expression language, you must know to use a regular expression pattern matching and text, and how easy it is. If you are not familiar with this term, the

  • Java regular expression extract 2010-04-13

    Regular Expression Tutorial 30 minutes http://deerchao.net/tutorials/regex/regex.htm javascript regular expression - Stuart is the United States http://www.cnblogs.com/rubylouvre/archive/2010/03/09/1681222.html Java regular expression Xiangjie http:/

  • java regular expression escape 2010-04-25

    Learning java regular expression encountered three problems. 1, java strings and string pattern is very clear 2, there is the concept of capturing group, and also of the capture group after the replacement string, which appendReplacement (StringBuffe

  • JAVA regular expression syntax (switch) 2010-07-09

    JAVA regular expression syntax (switch) Regular expression syntax Regular expression is a text mode, including ordinary characters (for example, a to z between the letters) and special characters (called "meta characters"). Model described in th

  • Java regular expression Xiangjie 2010-07-10

    Java regular expression Xiangjie On 2005-10-08 10:01: Cactus Studio Source: KissJava.com Editor: Wang Yuhan, if you have used Perl or any other built-in support for regular expression language, you must know to use regular expressions to process text

  • Java regular expression Xiangjie (Reprinted) 2010-08-23

    First, the basics of regular expressions We start with a simple start. Suppose you want to search for a containing characters "cat" string search with regular expression is "cat". If the search is not case sensitive, the word "cat

  • java regular expression based 2010-08-27

    Scanty previous regular expression, feeling good enough, but a while ago because of its low-level regular expression led to a misuse of online failure, deeply ashamed, they still re-lay the foundation for it. Concept: A regular expression is a patter

  • [Change] Java regular expression summary 2010-09-24

    Study: Wind in the bamboo Time: 2010-09-21 Source: http://www.cnblogs.com/fzzl/archive/2010/09/21/1832794.html Google search with "Java regular expression" is not very easy to find especially good on the java regular expression summary of the ar

  • Java regular expression Raiders (a) 2010-11-18

    Java regular expression Raiders (a) [2010-04-23 12:42:10.0 | On: Caprice categories: basic enhancement] Source: Network collected here in 1954 labels: Java regular expression Raiders (a) of the text characters in java regular expression special chara

  • Java regular expression Raiders (b) 2010-11-18

    Java regular expression Raiders (b) [2010-04-23 12:43:42.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 1512 labels: Java regular expression Raiders (b) java regular expression characters The regular expression engin

  • Java regular expression Raiders (c) 2010-11-18

    Java regular expression Raiders (c) [2010-04-23 12:44:39.0 | On: Caprice categories: basic enhancement] Source: Network Collection Browse 2057 labels: Java regular expression Raiders (c) java regular expression quantifier / qualifier Greedy greedy mo

  • java regular expression matching Chinese 2011-01-25

    Java regular expression to match the Chinese characters? The following examples are given so that we match all the Chinese characters: public static void regxChinese(){ // The string to match String source = "<span title='5 Star hotel '>";

  • Java Regular Expression Application summary (change) 2010-03-29

    First, an overview of A regular expression is Java Dealing with strings, the text of the important tool. Java The processing of regular expressions in the following two two classes: java.util.regex.Matcher pattern class: used to indicate a compiled r

  • java regular expression profiles (1) 2010-12-26

    Regular expression usage scenarios * Forms authentication: how to validate the user registration to use the correct mailbox? English name if it contains Chinese characters? Phone number is correct? More quickly find, replace: for example, grep, etc.

  • Java regular expression in a variety of characters, and an explanation of escaped characters 2010-04-11

    1. Character x character x. For example, a character that a \ \ Backslash character. In writing when the write to \ \ \ \. (Note: because java when the first resolution, to \ \ \ \ parsing into the regular expression \ \, again when the second resolu

  • Java regular expression API Summary 2010-06-04

    Starting from Java1.4, Java core API java.util.regex on the introduction of the package, it is a valuable foundation tool for many types of text processing, such as matching, search, extraction and analysis of structured content . java.util.regex is

  • application of java regular expression 2010-06-06

    First of all, the point of java key concept of regular expressions: First, the relevant class: Pattern, Matcher Second, the typical call sequence is Pattern p = Pattern.compile ("a * b"); Matcher m = p.matcher ("aaaaab"); boolean b = m