Computing string similarity

2010-08-17  来源:本站原创  分类:CPP  人气:208 

Reprinted from http://blog.csdn.net/liwenjia1981/archive/2010/07/13/5731040.aspx
Programming of the United States 3.3

After reading the title, no clue

Thinking in Solving the book very well, the distance between the first two strings must be a known number, must be less than the sum of the two strings.

By deletion of the two strings have become the empty string.

Recursive method as shown in the book, the code knocked out, there is little difference

view plaincopy to clipboardprint?

#include <stdlib.h>
#include <stdio.h>
#include <string.h>  

void calDistance(char *a, int aBegin, int aEnd, char * b, int bBegin, int bEnd, int& dis);
int min(int a, int b, int c);  

int main()
{
    char a[50];
    char b[50];
    int dis = 0;
    printf("First String :\n");
    scanf("%s", a);
    printf("Second String : \n");
    scanf("%s", b);
    int aLen = strlen(a);
    int bLen = strlen(b);
    calDistance(a,0,aLen-1, b, 0, bLen-1, dis);
    printf("The two string distance is :%d\n", dis);
    return 0;
}  

int min(int a, int b, int c)
{
    if(a < b && a < c)
        return a;
    if(b <a && b < c)
        return b;
    return c;
}  

void calDistance(char *a, int aBegin, int aEnd, char * b, int bBegin, int bEnd, int& dis)
{
    if(aBegin > aEnd)
    {
        if(bBegin > bEnd)
            return ;
        else
        {
            dis = dis + bEnd - bBegin + 1;
            return ;
        }
    }
    else if(bBegin > bEnd)
    {
        dis = dis + aEnd - aBegin + 1;
        return ;
    }  

    if(a[aBegin] == b[bBegin])
        calDistance(a, aBegin + 1, aEnd, b, bBegin + 1, bEnd, dis);
    else
    {
        int num1, num2, num3;
        num1 = num2 = num3 = dis + 1;
        calDistance(a, aBegin + 1, aEnd, b, bBegin, bEnd, num1); // To give  b[bBegin] Before  a[aBegin]
        calDistance(a, aBegin, aEnd, b, bBegin + 1, bEnd, num2); // To give  a[aBegin] Before  b[bBegin]
        calDistance(a, aBegin + 1, aEnd, b, bBegin + 1, bEnd, num3); // Will  a[aBegin] Replacement for  b[bBegin], Or will  b[bBegin] Replacement for  a[aBegin]
        dis = min(num1, num2, num3);
    }
}

Because of the existence of repeated sub-code problem, optimization, calculated using the sub-structure of a temporary problem, once again calling for direct search results.

Related to repeated sub-problems associated with dynamic programming ... the concrete realization of the memorandum, ah need to reference the ...

Reference article:

http://tech.ddvip.com/2009-05/1243159182120785_3.html

DP Solution of the first code switch

Always take full advantage of the dynamic programming algorithm overlapping sub-problems, each sub-problem only through a solution, the solution is stored in a necessary, you can view the table, and each look-up table of time constants.

Dynamic programming algorithm with bottom-up.

view plaincopy to clipboardprint?

/*
* A loop method using dynamic programming.
* Calculate from bottom to top.
*/
int calculateStringDistance(string strA, string strB)
{
     int lenA = (int)strA.length();
     int lenB = (int)strB.length();
     int c[lenA+1][lenB+1]; // Record the distance of all begin points of each string  

      // i: begin point of strA
     // j: begin point of strB
     for(int i = 0; i < lenA; i++) c[i][lenB] = lenA - i;
     for(int j = 0; j < lenB; j++) c[lenA][j] = lenB - j;
     c[lenA][lenB] = 0;  

     for(int i = lenA-1; i >= 0; i--)
         for(int j = lenB-1; j >= 0; j--)
         {
             if(strB[j] == strA[i])
                 c[i][j] = c[i+1][j+1];
             else
                 c[i][j] = minValue(c[i][j+1], c[i+1][j], c[i+1][j+1]) + 1;
         }  

     return c[0][0];
}

Sub memorandum practice:

Memorandum is a variant of dynamic programming, both with the usual dynamic programming efficiency, but also uses a top-down strategy.

Added a note of the recursive algorithm for each sub-problem solution in the form of a table entry in the record. Beginning of each table entry initially contains a special value to indicate that the table entry to be filled. When the recursive algorithm is first encountered in the implementation of a sub-problem, the solution to calculate and fill in the table. Each subsequent encounter of the sub-problem, as long as the view and return to the previous value can be filled.

The following is a recursive algorithm to do the original version of the memorandum, and by a Boolean variable to control whether memoize memorandum, and a Boolean variable to control whether to print debug call process. Interested in reading are available through the control of the two Boolean variables to compare versions of the memorandum and non-complexity version of the memorandum.

view plaincopy to clipboardprint?

#include <iostream>
#define M 100  

using namespace std;  

const bool debug = false; // Whether to print debug info
const bool memoize = true; // Whether to use memoization
unsigned int cnt = 0; // Line number for the debug info  

int memoizedDistance[M][M]; // Matrix for memoiztion
int minValue(int a, int b, int c)
{
     if(a < b && a < c) return a;
     else if(b < a && b < c) return b;
     else return c;
}  

/*
  * A recursive method which can be decorated by memoization.
  * Calculate from top to bottom.
  */
int calculateStringDistance(string strA, int pABegin, int pAEnd, string strB, int pBBegin, int pBEnd)
{
     if(memoize && memoizedDistance[pABegin][pBBegin] >= 0)
         return memoizedDistance[pABegin][pBBegin];  

     if(pABegin > pAEnd)
     {
         if(pBBegin > pBEnd)
         {
             if(memoize)
                 memoizedDistance[pABegin][pBBegin] = 0;
             if(debug)
                 cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=0" << endl;
             return 0;
         }
         else
         {
             int temp = pBEnd - pBBegin + 1;
             if(memoize)
                 memoizedDistance[pABegin][pBBegin] = temp;
             if(debug)
                 cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=" << temp << endl;
             return temp;
         }
     }  

     if(pBBegin > pBEnd)
     {
         if(pABegin > pAEnd)
         {
             if(memoize)
                 memoizedDistance[pABegin][pBBegin] = 0;
             if(debug)
                 cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=0" << endl;
             return 0;
         }
         else
         {
             int temp = pAEnd - pABegin + 1;
             if(memoize)
                 memoizedDistance[pABegin][pBBegin] = temp;
             if(debug)
                 cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=" << temp << endl;
             return temp;
         }
     }  

     if(strA[pABegin] == strB[pBBegin])
     {
         int temp = calculateStringDistance(strA, pABegin+1, pAEnd, strB, pBBegin+1, pBEnd);
         if(memoize)
             memoizedDistance[pABegin][pBBegin] = temp;
          if(debug)
             cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=" << temp << endl;
         return temp;
     }
     else
     {
         int t1 = calculateStringDistance(strA, pABegin, pAEnd, strB, pBBegin+1, pBEnd);
         int t2 = calculateStringDistance(strA, pABegin+1, pAEnd, strB, pBBegin, pBEnd);
         int t3 = calculateStringDistance(strA, pABegin+1, pAEnd, strB, pBBegin+1, pBEnd);
         int temp = minValue(t1, t2, t3) + 1;
         if(memoize)
             memoizedDistance[pABegin][pBBegin] = temp;
         if(debug)
             cout << cnt++ << ": m(" << pABegin << "," << pBBegin << ")=" << temp << endl;
         return temp;
     }
}  

int main()
{
     if(memoize)
     {
         // initialize the matrix : memoizedDistance[][]
         for(int i = 0; i < M; i++)
             for(int j = 0; j < M; j++)
                 memoizedDistance[i][j] = -1; // -1 means unfilled cell yet
     }  

     string strA = "abcdfef";
     string strB = "a";  

     cout << endl << "Similarity = "
             << 1.0 / (1 + calculateStringDistance(strA, 0, (int)strA.length()-1, strB, 0, (int)strB.length()-1))
             << endl;  

     return 0;
 }

Summary:

Can be calculated, if the dynamic programming or do not have the memorandum, the worst case complexity is about: lenA! * LenB! . Using dynamic programming complexity is O ((lenA +1) * (lenB +1)). Recursive way and do memorandum for the worst case complexity O ((lenA +1) * (lenB +1)).

In practice, if all sub-problems to be calculated at least once, then a bottom-up dynamic programming algorithm is usually better than a top-down method is better to do a memorandum of constant factor, because the former without the cost of recursive and maintenance costs are also smaller tables.

In addition, in some problems, you can also use the dynamic programming algorithm in the table access mode to further reduce the time or space requirements. Or, if the sub problem space in some sub-There is no need to solve the problem, do memorandum approaches have only Xie affirmative requirement that sub-problem solutions advantage for this issue is.

http://blog.csdn.net/jeiwt/archive/2009/12/10/4981876.aspx

-------------------------------------------------- -------------------------------------------------- ----

Two strings have a lot of similarity calculation, are compiled from the algorithm based on the similarity between their own under the summary.

Briefly under the Levenshtein Distance (LD): LD may be to measure the similarity of two strings. The distance between them is that a string into a string in the process of add, delete, modify the value.

For example:

If str1 = "test", str2 = "test", then LD (str1, str2) = 0. Without conversion.
If str1 = "test", str2 = "tent", then LD (str1, str2) = 1. str1 the "s" conversion "n", conversion of a character, so is 1.
If the distance between them greater, indicating they are more different.

Levenshtein distance is the first Russian scientist Vladimir Levenshtein in 1965, invented, with his name. Not spelling, you can call it edit distance (edit distance).

Levenshtein distance can be used to:

Spell checking (spell check)
Speech recognition (statement recognition)
DNA analysis (DNA analysis)
Plagiarism detection (plagiarism detection)
LD with m * n matrix stored distance value. Algorithm about the process:

The length of str1 or str2 is 0 return another string length.
Initialize (n +1) * (m +1) matrix d, and let the first row and column values from 0 to growth.
Scan the two strings (n * m level), if: str1 [i] == str2 [j], with the temp record it as 0. Otherwise recorded as a temp. Then in the matrix d [i] [j] confers d [i-1] [j] +1, d [i] [j-1] +1, d [i-1] [j-1] + temp c are the minimum.
After scanning, returns the last value matrix is d [n] [m]
The final return of their distance. According to the similarity of how it obtained from? Because their maximum distance is the maximum length of two strings. The string is not very sensitive. Similarity of formula I are set to 1 - their distance / maximum string length.

Source:

package  com.chenlb.algorithm;

/**
 *  Edit distance of two string similarity
 *
 *  @author  chenlb 2008-6-24  In the afternoon  06:41:55
  */
public   class  Similarity {

     private   int  min( int  one,  int  two,  int  three) {
         int  min  =  one;
         if (two  <  min) {
            min  =  two;
        }
         if (three  <  min) {
            min  =  three;
        }
         return  min;
    }

     public   int  ld(String str1, String str2) {
         int  d[][];     //  Matrix
         int  n  =  str1.length();
         int  m  =  str2.length();
         int  i;     //  Traverse str1
         int  j;     //  Traverse the str2
         char  ch1;     // str1 By
         char  ch2;     // str2 By
         int  temp;     //  Records of the same character, in a matrix location value increments  , It is not 0  1
         if (n  ==   0 ) {
             return  m;
        }
         if (m  ==   0 ) {
             return  n;
        }
        d  =   new   int [n + 1 ][m + 1 ];
         for (i = 0 ; i <= n; i ++ ) {     //  Initialize the first column
            d[i][ 0 ]  =  i;
        }
         for (j = 0 ; j <= m; j ++ ) {     //  Initialize the first row
            d[ 0 ][j]  =  j;
        }
         for (i = 1 ; i <= n; i ++ ) {     //  Traverse  str1
            ch1  =  str1.charAt(i - 1 );
             //  Go to match  str2
             for (j = 1 ; j <= m; j ++ ) {
                ch2  =  str2.charAt(j - 1 );
                 if (ch1  ==  ch2) {
                    temp  =   0 ;
                }  else  {
                    temp  =   1 ;
                }
                 //  Top left + 1,  +1,  Corner + take minimum temp
                d[i][j]  =  min(d[i - 1 ][j] + 1 , d[i][j - 1 ] + 1 , d[i - 1 ][j - 1 ] + temp);
            }
        }
         return  d[n][m];
    }

     public   double  sim(String str1, String str2) {
         int  ld  =  ld(str1, str2);
         return   1   -  ( double ) ld  /  Math.max(str1.length(), str2.length());
    }

     public   static   void  main(String[] args) {
        Similarity s  =   new  Similarity();
        String str1  =   " chenlb.blogjava.net " ;
        String str2  =   " chenlb.javaeye.com " ;
        System.out.println( " ld= " + s.ld(str1, str2));
        System.out.println( " sim= " + s.sim(str1, str2));
    }
}

I do not know the formula sim method is reasonable and think poorly thought, ^ _ ^

相关文章
  • Computing string similarity 2010-08-17

    Reprinted from http://blog.csdn.net/liwenjia1981/archive/2010/07/13/5731040.aspx Programming of the United States 3.3 After reading the title, no clue Thinking in Solving the book very well, the distance between the first two strings must be a known

  • String similarity algorithm (Levenshtein Distance algorithm) 2010-12-26

    Programming Software Engineering II Summary Title: a string by adding a character, delete a character, replace a character string by another, assumptions, we convert from a string A string of B, in front of three kinds of operations The minimum numbe

  • Levenshtein string similarity algorithm 2010-10-11

    Edit distance algorithm was first proposed by Russian scientists Levenshtein, it is also known as Levenshtein Distance. A string can be by adding a character, delete a character, replace a character string by another, assumptions, we convert from a s

  • Php string similarity comparison 2011-04-06

    In addition to using cookies, IP restrictions on technology, we can use PHP's own band similar_text function to determine the similarity of the user posting the content. similar_text () function to calculate the two strings match the number of charac

  • Public String substring with maximum similarity String Process (2) 2010-04-06

    Most common sub-string: 2009-11-27 10:421. Levenshtein Distance The algorithm is also known as "edit distance", used to calculate the similarity of two strings. The principle is very simple, it is returned to the first string (delete, insert, re

  • String and the string of maximum similarity 2010-09-09

    Most common sub-string: The algorithm is also known as "edit distance", used to calculate the similarity of two strings. The principle is very simple, it is returned to the first string (delete, insert, replace) into a second string edits. Times

  • [String handling 2] string edit distance algorithm 2010-03-15

    We look at a practical application. Many modern search technology to provide high quality, efficient service as the target. For example: baidu, google, sousou and other well-known full-text search system. When we enter a wrong query = "Jave" whe

  • [String handling 6] LCS longest common subsequence 2010-03-23

    LCS: also known as the longest common subsequence. One sub-sequence (subsequence) of the concept is different from the string substring. It is a continuous but not necessarily in sequence from the string X characters in the sequence. For example: str

  • PHP string handling functions Daquan 2010-09-03

    AddSlashes: string to slash. bin2hex: binary conversion hexadecimal. Chop: remove the blank row. Chr: return value of the character sequence. chunk_split: the string into short paragraphs. convert_cyr_string: convert strings into other strings Guslav

  • PHP string functions study notes 2011-01-04

    addcslashes() Add the specified character before the backslash addslashes() Pre-defined character in the specified add a backslash before bin2hex() ASCII character string to convert to hexadecimal values chop() rtrim() Alias chr() Returns the value f

  • Arch-03-15-more pictures of the similarity 2011-07-14

    More pictures of the similarity Obviously strange and difficult, the first collection point algorithm make up classes. (1) computing image similarity - "Python can also be" one of the http://blog.csdn.net/lanphaday/article/details/2325027 (2) di

  • PHP character processing function set 2010-12-09

    AddSlashes: string to slash. bin2hex: convert binary hex. Chop: remove the blank row. Chr: Returns the ordinal value of character. chunk_split: the string into small pieces. convert_cyr_string: convert a string into another string Cyrillic. crypt: a

  • Jaro–Winkler distance 2013-06-29

    From Wikipedia, the free encyclopedia (Redirected from Jaro-Winkler distance) This article is about the measure. For other uses, see Jaro. This article may contain original research. Please improve it by verifying the claims made and adding inline ci

  • Lucene on the realization of Similarity custom sorting 2010-03-29

    Opening Remarks: As a human resources Web site search function, not only need to test the search performance and efficiency of filtration, and needs attention to user experience is mainly reflected in user satisfaction with search results. We all kno

  • String is the sum of 100,000 What happens?? 2010-03-29

    Recent inspection of a company's platform, BUG, BUG specific symptoms are: Platform users to upload an attachment, the author need to wait for 30-50 minutes, and sometimes simply do not pass up. Solve the BUG: 1, I would first check the platform log

  • JAVA in the floating-point computing 2010-03-29

    Of the problem: If we compile this program will be run the following to see? public class Test ( public static void main (String args []) ( System.out.println (0.05 +0.01); System.out. println (1.0-0.42); System.out.println (4.015 * 100); System.out.

  • StringTokenizer: String separated by parsing type 2010-05-01

    StringTokenizer: String separated by parsing type is: java.util package. 1, constructor. 1. StringTokenizer (String str): construct a StringTokenizer to parse the object str. java default separator is "space", "tab ( '\ t')"," new

  • Hadoop - Mass file distributed computing program 2010-03-06

    http://blog.csdn.net/calvinxiu/archive/2007/02/09/1506112.aspx Hadoop is a Java implementation of the Google MapReduce. MapReduce is a simplified distributed programming model that allows automatically distributed to a large cluster composed of ordin

  • Java, byte hexadecimal string with 16 interchangeable 2009-11-28

    Java, said the occupation of byte binary 8-bit, and we know that each of 16 hexadecimal characters need to use 4-bit binary bits to represent the (23 + 22 + 21 + 20 = 15), so we can convert each byte two corresponding 16 hexadecimal characters, that

  • [Change] c # grid computing was a DEM map of the area below a certain height, and display 2010-03-16

    IRaster pRaster = pRasLyr.Raster; // Gets the known raster layer Raster object pRasLyr IGeoDataset tempGeodata = pRaster as IGeoDataset; IMapAlgebraOp rsalgebra = new RasterMapAlgebraOpClass(); // Sets the raster operation space IRasterAnalysisEnviro