Construction of sub-dictionary mechanism (a)

2010-10-05  来源:本站原创  分类:Internet  人气:195 

Initial contact point for friends, the word for the construction of sub-word dictionary is a thing not to be underestimated. Because the dictionary has a direct impact on performance of the algorithm, running time. In other words, word dictionary constructed well, will improve significantly the performance of segmentation, and a variety of complex segmentation algorithm, is directly dependent on the structural mechanism of sub-dictionary (word segmentation is the foundation.) The following will be constructed in several parts of the mechanism Dictionary describes several methods.
In this article, as I used the most basic approach to the construction of dictionaries, the spelling of the index method. (And we think the most direct method)
The following combination of my application, to share specific practices.
1. Files used to generate a LinkedHashMap pinyin table, and corresponding with the corresponding key. As follows:
hashMap.put ("a", 0);
hashMap.put ("ai", 1);
hashMap.put ("an", 2);
hashMap.put ("ang", 3);
hashMap.put ("ao", 4);
hashMap.put ("ba", 5);
hashMap.put ("bai", 6);
hashMap.put ("ban", 7);
The alphabet "a" on the hash map positions 0, and so on.
2. file This file is used to the pronunciation of Chinese characters into. For example, type "Hello" will return "nihao". Which used getCnAscii () method is based on the Chinese national standard code to the corresponding value of the corresponding type int. As follows:
public static int getCnAscii (char cn) (
byte [] bytes = null;
try (
bytes = (String.valueOf (cn)). getBytes ("gbk");
) Catch (UnsupportedEncodingException e) (
/ / TODO Auto-generated catch block
e.printStackTrace ();
if (bytes == null | | bytes.length> 2 | | bytes.length <= 0) (
return 0;
if (bytes.length == 1) (
return bytes [0];
if (bytes.length == 2) (
int hightByte = 256 + bytes [0];
int lowByte = 256 + bytes [1];
int ascii = (256 * hightByte + lowByte) - 256 * 256;
/ / System.out.println ("ASCII =" + ascii);
return ascii;
return 0;

The initialize () method will correspond to the values and save the corresponding relationship between phonetic hash hash map.
(LinkedHashMap <String, Integer> spellMap = new LinkedHashMap <String, Integer> (400);)
spellPut ("a", -20319);
spellPut ("ai", -20317);
spellPut ("an", -20304);
spellPut ("ang", -20295);
spellPut ("ao", -20292);
spellPut ("ba", -20283);
spellPut ("bai", -20265);
spellPut ("ban", -20257);
spellPut ("bang", -20242);
spellPut ("bao", -20230);
spellPut ("bei", -20051);
spellPut ("ben", -20036);
spellPut ("beng", -20032);
... ...

So for a character to be converted to pinyin, the first with getCnAscii () method to obtain int type value, and then under the hash map to find the appropriate pronunciation.
In particular, some of the more than one pronunciation for the word would go to the top surface of the first alphabet. In addition, other circumstances, such as some of the complex were unable to find the corresponding Chinese characters spelling, will be classified as a separate category. (Because this situation is not too much of Chinese characters, so for performance, there is no impact)
3. traverse a txt file dictionary file, in accordance with the pronunciation of a Chinese character will return them in accordance with the different pronunciation of a different class.
public void makeDictionary () throws IOException (/ / Open a stream to file a WordTable.txt
BufferedReader bf = new BufferedReader (new FileReader ("stopword.txt"));
String dicItem = "";// word of each line
int currentZiYinIndex = 0; / / current pronunciation of the key
while ((dicItem = bf.readLine ())! = null) (
String firstCharacter = dicItem.substring (0, 1); / / Get the first word
String ziYin = CnToSpell.getFullSpell (firstCharacter); / / Get the first word pronunciation
if (hashMap.get (ziYin)! = null) (
currentZiYinIndex = hashMap.get (ziYin). intValue (); / / get the key corresponding to the first pronunciation
if (dicStr [currentZiYinIndex] == null) (
dicStr [currentZiYinIndex] = dicItem + "";
) Else (
dicStr [currentZiYinIndex] + = dicItem + "";
) / / Otherwise, failed to identify the array into
else (/ / unknownWords each overlay
if (dicStr [395] == null) (
dicStr [395] = dicItem + "";
) Else (
dicStr [395] + = dicItem + "";
Similarly, to find a certain entry, the first word into under the first alphabet, and then narrow the search range according to the alphabet, improve query speed. Procedures are as follows:
public boolean lookDictionary (String words) (
boolean flag = false; / / define boolean variable used to mark whether they find
String firstWord = "";
if (words.length ()> 1) (
firstWord = words.substring (0, 1);
) Else if (words.length () == 1) (
firstWord = String.valueOf (words.charAt (0));
/ / System.out.println (firstWord);
String ziYin = CnToSpell.getFullSpell (firstWord); / / Get the first word pronunciation
int index = 0; / / used to mark key
if (hashMap.get (ziYin)! = null) (
index = hashMap.get (ziYin). intValue ();// get keys, find the corresponding array
) Else (
index = 395;
if (dicStr [index] == null)
return false;
String singleWord [] = dicStr [index]. Split ("");// a long string into multiple words by spaces
int numOfSpace = 0; / / Define a variable to count the number of spaces to be determined by the number after the array
for (int i = 0; i <dicStr [index]. length (); i + +) (/ / statistics of the number of spaces to determine the number of array
if (dicStr [index]. charAt (i) == '') (
numOfSpace + +;
/ / End of the array in the search by word, the flag is set to true if found
for (int ind = 0; ind <(numOfSpace); ind + +) (
if (singleWord [ind]. equals (words)) (
flag = true;
return flag;
Of course, the composition of sub-word dictionary, there are other mechanisms of mainstream focus in the following documents will be sharing with everyone.

  • Construction of sub-dictionary mechanism (a) 2010-10-05

    Initial contact point for friends, the word for the construction of sub-word dictionary is a thing not to be underestimated. Because the dictionary has a direct impact on performance of the algorithm, running time. In other words, word dictionary con

  • Software architecture design points 2010-03-26

    Today's software industry, software architecture decisions, and the software is good or bad. Other than people like skeleton, skeleton, etc. If people are not good, then people are very vulnerable. Software architecture in software engineering that w

  • [Word] being the maximum matching algorithm for Chinese word segmentation 2010-05-15

    Chinese Chinese word has always been the basis for natural language processing research. Currently, many of the popular network of Chinese word segmentation software costs can be paid less, while with higher accuracy. And many Chinese word software s

  • UML based on [switch] 2010-12-26

    UML basics UML Introduction In the late 80s to 90s, the object-oriented analysis and design methods of research and development to a climax. However, many schools of thought and terminology, there are many different formulations, in the terminology,

  • Construction of sub-dictionary mechanism (b) 2010-11-09

    In this article, focusing on the use of the structural characteristics of the tree to the sub-dictionary of the organization. Trinomial tree, by definition, has three branches of lower, higher, equal. First, for each node set "to the word", that

  • Word dictionary structure mechanism (II) 2010-11-09

    In this article, focuses on the use of the structural characteristics of the trigeminal tree for the organization of sub-word dictionary. Trigeminal tree, by definition, has three branches of lower, higher, equal. First, for each of the nodes set "to

  • Introduction Compact design - the construction of error-free software 2009-08-26

    Thanks to the translator's hard to pay, but also thanks to the wisdom of sharing original author. Reprinted from Mr: ================================================== ================

  • windows command dictionary (collection) 2010-02-10

    NET use \ \ IP \ IPC $ "" / User: "" the establishment of air links IPC NET use \ \ IP \ IPC $ "password" / user: "User name" non-establishment of air links IPC NET use H: \ \ IP \ c $ "password" / user: &

  • Android Activity and Intent Mechanism Study Notes 2010-03-29

    Activity Android in, Activity is the root of all programs, all of processes are running in the Activity among, Activity has its own life cycle (see . html, controlled by the system life cycle,

  • java reflection mechanism (switch) 2010-03-26

    (Respect for the original, switched: First, the concept of reflection: The concept of reflection by the Smith first proposed in 1982, mainly is talking about the procedure can acces

  • oracle of the working mechanism 2010-03-15

    Quote The working mechanism of -1 ORACLE We started talking from a user request, ORACLE summary of the work mechanism is, first of all a user process sends a connection request, if using a host name or host name of local service hit using a machine n

  • Python Cookbook - 1.4: get a value from the dictionary 2010-03-05

    Get a value from the dictionary Problem You need to get a value from the dictionary, do not deal in a dictionary can not find the keys you need anomaly. That is, get a dictionary method. If you have a dictionary d = {'key':'value'} Exception safety i

  • Transfer: cookie mechanism and the difference between session mechanism 2010-04-19

    Originally posted address: 1, cookie mechanism and mechanism of the difference between session cookie mechanism is adopted specifically on the client to maintain state of the pro

  • oracle locking mechanism summarized (change) 2010-04-23

    Original address: Lock is to prevent the operation of the two services the same data source (table or row), the destruction of data interaction as a mechanism. Oracle uses blocking techniques to ensure serializa

  • ORACLE DB time task mechanism 2010-05-13

    Introduction This article first introduces the Oracle 8 of the basic concepts of data replication, and then described in detail to achieve Oracle 8 data replication latency transaction queue forwarding Oralce 8 task queue mechanism, including "task q

  • What is unicast. Multicast and broadcast. Broadcast mechanism. Multicast 2010-05-31

    - ★ unicast ★ - Unicast (Unicast) Unicast (Unicast) is in the network from source to destination, unicast forwarding process. Unicast communication is Quwang a unique address. In this case, only one sender and a receiver. The term existence of the mu

  • Ningbo Branch of China Construction Bank Medium Business Platform 2010-06-15

    Ningbo Branch of China Construction Bank, intermediate business platform, Ningbo Branch of China Construction Bank is an important system, local characteristics, class of business operations and channel access are deployed on the system. Background:

  • Javascript event registration mechanism 2010-06-22 Also supports three of the javascript event model JavaScript event allows the client a chance to be activated, and can be run. In a Web page loaded, run the script the only way is to respond

  • java exception handling mechanism (2) 2010-06-26

    Fourth, how to define and use the exception classes 1, using the existing exception classes, if the IOException, SQLException. try ( Code ) Catch (IOException ioe) ( Code ) Catch (SQLException sqle) ( Code ) Finally ( Code ) 2, create a custom except

  • Java's exception handling mechanism 2010-07-21

    Exception handling is a programming in a very important aspect of program design is a major difficulty, starting from C, you may already know how to use the if ... else ... to control abnormal, and may be spontaneous, but this kinds of abnormal pain