Transfer from: http://edu.codepub.com/2009/0910/15270.php

Research a text clustering

Internet has developed into the world's largest information base and the global dissemination of information within the main channel. With the massive popularity of Internet and business improvement in the level of information, resources exploding. In the China Internet Network Information Center (CNNIC) 2007 1. latest statistics of China Internet Development Report showed that 70.2% of the network information are reflected in text form. For this semi-structured or unstructured data, how to derive specific information and knowledge content as before the people of a problem. In recent years, text mining, information filtering and information retrieval research in an unprecedented upsurge.

As an unsupervised machine learning method, clustering a large number of text messages can be composed of a small number of meaningful clusters, and to provide navigation or browsing mechanism.

The main application of text clustering points include:

(1) text clustering can be used as multi-document automatic summarization of natural language processing applications such as pre-processing step. One typical example is the development of Columbia University, Multi-document automatic summarization system Newsblaster [1]. The system will be news to

Clustering process, and eliminate redundancy with the theme of the document, information fusion, text generation and other processing, to produce a concise summary of the document.

(2) search engine returns the results of clustering, allowing users to quickly locate the information they need. Typical system Infonetware Real Term Search. Infonetware a powerful search function subject classification results. In addition, the Carrot Search developing Java-based open source Carrot2 Search Results Clustering Engine version 2.0 is also aggregate this use, Carrot2 can automatically classify the natural search results (aggregate cluster) to the corresponding semantic category, provided Based on level, synonymous and label filtering function.

(3) to improve the results of text classification, such as the Ohio State University YCFang and others work [2].

(4) document set automatic sorting. Such as Scatter / Gather [3], it is a document browsing system based on clustering.

2 text clustering process

Text clustering based primarily on cluster assumption: greater similarity of the same document, the document similarity smaller non-similar. As an unsupervised machine learning method, clustering due process does not require training, and do not need to manually pre-marked on the document type, it has high flexibility and automation capabilities, as the effective organization of the text, abstracts, and important means of navigation. The specific process of clustering shown in Figure 1.

Figure 1 text clustering process

2.1 Preprocessing of text information

Text clustering problem now is to the text content is expressed as the mathematical form of treatment can be analyzed that the establishment of the texts, with certain characteristics (such as a term or description) to represent the target text. To create a text message of the text features, commonly used methods are: pretreatment of the text (part of speech tagging, semantic tagging), build statistical dictionary, segmentation of the text entry, word to complete the process of text information.

2.2 The establishment of the text message feature

Text messaging feature that model has a variety of commonly used types are Boolean logic, vector space model, probabilistic and mixed so. Among them, the vector space model (Vector Space Model, VSM) is used more often in recent years, and one way is better [4]. In 1969, Gerard Salton proposed a vector space model VSM, which is the document that a statistical model. The main idea of this model are: the ground of each document are mapped a set of standardized orthogonal term vectors Zhang vector space to a point. For all of the document class and unknown documents, can use this space in the entry vector (T1, W 1, T 2, W2, ..., Tn, Wn) to represent (where, Ti is characterized by vector entry; Wi as Ti weight) [5]. Generally need to construct an evaluation function to represent the entry weight, the only criterion for the calculation is to maximize the difference between different documents. This vector space model representation of the biggest advantages is that the unstructured and semi-structured text representation for the vector form, makes all kinds of mathematical treatment possible.

2.3 The text of the reduced feature set

VSM text content can be expressed as mathematical analysis and processing of forms, but there is an issue with the amazing document feature vector dimension. Therefore, before processing the text clustering, text messaging feature set should be reduced. The usual approach is to weight each feature entry order, select a predetermined number of the best features of the resulting feature subset. Select the number and the use of the evaluation function, are targeted to specific problems of the decision.

Reduce the dimensionality of text feature vector is used, another method of sparse vector representation. Although the text message feature set vector dimension is very large, but for a single document, most elements of a vector Du is zero, this feature also Jueding a single document in the vector representation will be a sparse vector. To save memory space, while accelerating the processing speed cluster, you can use the sparse vector representation. Suppose the feature vectors to determine the number of entries for the n, said the traditional method and (T1, W 1, T 2, W2, ..., Tn, Wn) sparse representation for the (D 1, W1, D2, W2, Dp, ..., Wp, n) (Wi ≠ 0). Where, Di as weight is not zero eigenvector entries; Wi their respective weights; n the vector dimension. This would be greatly reduced memory footprint, improved the efficiency of clustering, but because each text feature vector dimension is inconsistent, to some extent increased the difficulty of mathematical treatment.

2.4 Text Clustering

In the text can be expressed as mathematical analysis and processing of forms, the next job is the basis of this mathematical form, the text clustering process. There are two kinds of clustering methods: Based on the probability [6] and based on distance [7]. Probability-based approach to Bayesian probability theory, with probability distribution method described in clustering results. Distance-based methods, is to feature vector, said the document, the document vector space as a point, by calculating the distance between the points to cluster.

At present, the text clustering based on distance more mature approach can be divided into 2 types: hierarchical and flat classification method condensation method.

For a given document collection D = (d1, d 2, ..., di, ..., dn), the specific levels of coagulation process is as follows:

(1) D, di for each file as a single member of the cluster ci = (di), these clusters form a D cluster C = (c1, c2, ..., ci, ..., cn);

(2) calculation of C in each of the clusters (ci, cj) similarity between the sim (ci, cj);

(3) Select the cluster with the greatest similarity to the (ci, cj) ci and cj be merged into a new cluster ck = sim ci ∪ cj, D to form a new cluster C = (c1, c 2 , ..., cn-1);

(4) Repeat the above steps until a cluster C in the far left. The process to construct a spanning tree, which contains information on the cluster level and within all the clusters and the similarity between clusters. For a given document set ()

D = (d1, d2, ..., di, ..., dn), the specific process of law, plane divided as follows:

(1) determine the number of clusters to generate k;

(2) in accordance with the principle of generating a k-cluster centers as the cluster of seeds S = (s1, s2, ..., si, ..., sk);

(3) D for each document di, then to calculate it with various seeds sj similarity sim (di, sj);

(4) Select the largest similarity with the seed, the di classified to the cluster center of cluster sj cj, D to get a clustering C = (ci, cj)

(5) Repeat this step several times to get stable clustering results. These 2 types of advantages and disadvantages. Levels of coagulation to generate hierarchical nested cluster accuracy. But every time the merger, you need to compare the overall similarity between all the clusters and select the best two clusters, so the implementation of slow, not suitable for large file collections. The plane is divided into a relatively fast method, but must first determine the values of k, and the seeds of the selected good or bad results will have significant impact on the cluster.

Considering these two kinds of advantages and disadvantages of clustering type, this paper, a vector space model based on improved methods of text clustering - LP algorithm. Specific process is as follows:

For a given document collection D = (d1, d 2, ..., di, ..., dn):

(1) D, di for each file as a single member of the cluster ci = (di);

(2) Choose one of the individual members of the cluster ci as a starting point for clustering;

(3) in the rest of the samples did not cluster, find the distance to meet the conditions ci dj (may be the nearest point with ci, that the similarity sim (ci, dj) the largest dj, also with a distance of ci d of the point above the threshold, that the similarity sim (ci, dj) ≥ d for any dj). Dj fall will form a new cluster ci ck = sim ci ∪ dj;

(4) Repeat steps (3), until and ci ci nearest distance between the dk and above the threshold d, then finished a class that has been together;

(5) Select a non-cluster single members of the cluster, repeat steps (3) and step (4), start a new round of clustering, until all of the individual members of the cluster ci are involved in the cluster.

LP algorithm does not compare the similarity between all clusters, the implementation of speed, a large number of documents for a collection of more practical. Meanwhile, in the clustering process does not require pre-determined the values of k reduce the dependence of domain knowledge and improve flexibility.

3 Experimental Design

In this paper, Sogou Sohu R & D lab of the Internet corpus link relationship Library SOGOU-T. The relationship between library provides a relationship between large-scale Internet link correspondence table, used to verify the relationship between the various link analysis algorithm is effective and feasible. Relations in the library data corpus is divided into 10 categories (C000007 car, C000008 Finance, C000010 IT, C000013 health, C000014 sports, C000016 tourism, C000020 education, C000022 recruitment, C000023 cultural, C000024 military). Corpus library available for download in the total relationship between three versions: Mini Edition, Starter Edition, full version. This article uses the former two versions of the experiment. Corpus is organized as follows: for the 10 categories set up a different folder, each folder, each 1 into a corpus from. Txt file.

The experiment is as follows:

(1) under all folders. Txt files were linked together in a large complete files, while preserving. Txt file the class (in this experiment retained the final two categories: 07,08, ...).

(2) The Institute of Computing Technology, Chinese Academy of Digital Room & Software Office issued an open platform for Chinese natural language processing, Chinese Lexical Analysis System ICTCLAS. Using ICTCLAS_Win, will (1) in the file-level tagging to word segmentation.

(3) statistics mark a good word for word frequency splitting.

(4) in accordance with the weight (term frequency) the size of the order segmentation of words, and keep the weight over a certain limit value (threshold value) of feature items. (In this study, retained more than 100 words in word frequency as feature key) at the same time, according to Chinese characteristics, in experiments designed two kinds of situation in order to compare clustering effect for parts of speech:

1) All kinds of words are involved in clustering;

2) retain only labeled as a noun phrase.

(5) According to (4) segmentation of the words defined in the base vector space vector structure, and determine the dimensions of space vector and other parameters.

(6) the corpus of each corpus file (. Txt files) are represented as a vector space. During the experiment, using the following two kinds of representation:

1) the traditional vector space representation: (T 1, W 1, T2, W2, ..., T n, Wn);

2) sparse vector representations of space: (D 1, W 1, D2, W2, ..., D p, Wp, n).

(7) clustering: clustering process is the focus of experiments, but also target the host.

1) cluster at the beginning before the first (6) has said that a good text space vectors are normalized. Vector normalization in the pattern recognition is a very important part of the purpose of the statistical distribution of events summarized in the 0-1 gray uniform probability of membership of the cluster, so that the process of clustering space vector for each sensitive degree are the same.

Traditional space vector:

,

Of which:

;

Among

http://edu.codepub.com/uploadfile/2009/0910/20090910100906272.jpg

2) In the experiment, using Euclidean distance to that between any two text vectors the distance.

Sparse vector space: the traditional space vector calculation method is similar to calculating the distance between the square and the same term arithmetic square root.

3) LP algorithm requires a pre-determined threshold. Experiment, the threshold strategy is to: set the initial threshold value (that is, individual members of the cluster for the threshold, this threshold adjustment based on experimental results many times), when the two clusters merged into one cluster, the new cluster threshold be merged by the merging algorithm based on the clustering features of clusters obtained.

Two clusters merge, the feature vectors were X = (T1, x1, T 2, x2, ..., Tn, Xn), Y = (T1, y1, T2, y 2, ..., Tn, yn), then The new cluster consisting of the eigenvectors for the

Merge Theorem: Assume that the merger of the two clusters, the combined cluster threshold is expressed as

Where, dist refers to two features of the distance between vectors.

4 Data Analysis

For this experiment, the three kinds of clustering methods that are involved, for their merits to do the experiment to study the level of comparison.

A: All types of words are used to build space vector;

B: Construction of vector space using only nouns;

C: using the traditional vector space representation;

D: use of sparse vector representation of space.

Mini Edition (SogouC.mini.20061102): A total of 100 documents, 10 in each category.

Starter Edition (SogouC.reduced.20061102): 10 020 documents, each class 1 002.

Table 1 shows the experimental results. Where, t (time) that clustering time-consuming, in units of ms; a (accuracy) that clustering accuracy. Clustering time spent depends on the implementation of the specific situation, so there are some differences. In the table exclude data collected by the mutation data (ie bad data), the average of many experimental results.

Table 1 Experimental results clustering

Analysis of the experimental results can be summarized into the following 5 points:

(1) for the streamlined version of the cluster, three kinds of methods are superior to Mini Edition. This is because, streamlined version of the basic data volume, data for the clustering of individual mutations on the relatively small effect.

(2) using sparse vector representation, the clustering of the time consumption by about 4 / 5, show that for high-dimensional vector with its sparse representation can also effectively save memory space, speed up the clustering processing speed.

(3) compared to hierarchical clustering, LP algorithm in time consumption to decline by about 30%, therefore, data volume, real-time requires high as to effectively reduce time-consuming, LP algorithm is shown out its advantages.

(4) compared to flat classification method, LP algorithm in clustering accuracy on the increase 11% to 13% to 77% ~ 83%, to ensure the accuracy of clustering within an acceptable .

(5) in this experiment, LP algorithm is slightly lower than levels in the clustering accuracy of the method, I think this is mainly because: The main idea of AHP is the global optimum, each clustered into a cluster of two members of the is the similarity between the largest, and in the LP algorithm, the decision will be two members of the same class as the only measure is the threshold value d. Threshold selection is good or bad for the experimental effect is very large. Therefore, how to select the initial value of the threshold and the clustering process to dynamically adjust the threshold value is the next major work.

5 Conclusion

In the text document clustering plays an important role in pattern recognition, which is the value of this paper. This paper analyzes the text clustering based on distance of the more mature 2 methods: Agglomeration and plane classification method, and an improved method of LP. Experimental results from the point of view, LP algorithm is faster, has more flexibility. In the follow-up work, will further test the algorithm on the basis of repeated revision and expansion, in order to achieve better clustering and practical effect.