Java中相似string的比较

我想比较几个string,find最相似的string。 我想知道是否有任何图书馆,方法或最佳做法,将返回哪些string更类似于其他string。 例如:

  • “狐狸跳” – >“狐狸跳”
  • “狐狸跳” – >“狐狸”

这个比较将返回第一个比第二个更类似。

我想我需要一些方法,如:

double similarityIndex(String s1, String s2) 

有什么地方有这样的事情吗?

编辑:我为什么要这样做? 我正在编写一个脚本,将MS Project文件的输出与处理任务的某个遗留系统的输出进行比较。 由于遗留系统的字段宽度非常有限,因此在添加值时会缩短说明的范围。 我想要一些半自动的方式来查找MS Project中的哪些条目与系统上的条目类似,以便我可以获取生成的密钥。 它有缺点,因为它仍然需要手动检查,但它会节省很多工作

是的,有许多有据可查的algorithm,如:

  • 余弦相似
  • Jaccard相似
  • 骰子的系数
  • 匹配相似性
  • 重叠相似性
  • 等等

或者, 你可以检查这个

检查这些项目:

计算两个string在0%-100%之间相似度的常用方法,正如许多库中所使用的,是测量多less(以%计),您必须更改较长的string以将其变成较短的string:

 /** * Calculates the similarity (a number within 0 and 1) between two strings. */ public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { // longer should always have greater length longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } // you can use StringUtils.getLevenshteinDistance() as the editDistance() function // full copy-paste working code is below 

计算editDistance()

上面的editDistance()函数可以计算两个string之间的编辑距离 。 这一步有几个实现 ,每个都可以更好地适应特定的情况。 最常见的是Levenshtein距离algorithm ,我们将在下面的示例中使用它(对于非常大的string,其他algorithm可能执行得更好)。

这里有两个选项来计算编辑距离:

  • 您可以使用Apache Commons Lang的Levenshtein距离实现: StringUtils.getLevenshteinDistance(CharSequence s, CharSequence t)
  • 在你自己的实现。 下面你会find一个示例实现。

工作示例:

在这里看到在线演示。

 public class StringSimilarity { /** * Calculates the similarity (a number within 0 and 1) between two strings. */ public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { // longer should always have greater length longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } /* // If you have StringUtils, you can use it to calculate the edit distance: return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) / (double) longerLength; */ return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } // Example implementation of the Levenshtein Edit Distance // See http://rosettacode.org/wiki/Levenshtein_distance#Java public static int editDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) costs[j] = j; else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) costs[s2.length()] = lastValue; } return costs[s2.length()]; } public static void printSimilarity(String s, String t) { System.out.println(String.format( "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t)); } public static void main(String[] args) { printSimilarity("", ""); printSimilarity("1234567890", "1"); printSimilarity("1234567890", "123"); printSimilarity("1234567890", "1234567"); printSimilarity("1234567890", "1234567890"); printSimilarity("1234567890", "1234567980"); printSimilarity("47/2010", "472010"); printSimilarity("47/2010", "472011"); printSimilarity("47/2010", "AB.CDEF"); printSimilarity("47/2010", "4B.CDEFG"); printSimilarity("47/2010", "AB.CDEFG"); printSimilarity("The quick fox jumped", "The fox jumped"); printSimilarity("The quick fox jumped", "The fox"); printSimilarity("kitten", "sitting"); } } 

输出:

 1.000 is the similarity between "" and "" 0.100 is the similarity between "1234567890" and "1" 0.300 is the similarity between "1234567890" and "123" 0.700 is the similarity between "1234567890" and "1234567" 1.000 is the similarity between "1234567890" and "1234567890" 0.800 is the similarity between "1234567890" and "1234567980" 0.857 is the similarity between "47/2010" and "472010" 0.714 is the similarity between "47/2010" and "472011" 0.000 is the similarity between "47/2010" and "AB.CDEF" 0.125 is the similarity between "47/2010" and "4B.CDEFG" 0.000 is the similarity between "47/2010" and "AB.CDEFG" 0.700 is the similarity between "The quick fox jumped" and "The fox jumped" 0.350 is the similarity between "The quick fox jumped" and "The fox" 0.571 is the similarity between "kitten" and "sitting" 

您可以使用Levenshtein距离来计算两个string之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance

我将Levenshtein距离algorithm翻译成JavaScript:

 String.prototype.LevenshteinDistance = function (s2) { var array = new Array(this.length + 1); for (var i = 0; i < this.length + 1; i++) array[i] = new Array(s2.length + 1); for (var i = 0; i < this.length + 1; i++) array[i][0] = i; for (var j = 0; j < s2.length + 1; j++) array[0][j] = j; for (var i = 1; i < this.length + 1; i++) { for (var j = 1; j < s2.length + 1; j++) { if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1]; else { array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1); array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1); } } } return array[this.length][s2.length]; }; 

那里确实有很多string相似度测量:

  • Levenshtein编辑距离;
  • Damerau-Levenshtein距离;
  • Jaro-Winkler相似;
  • 最长公共子序列编辑距离;
  • Q-Gram(Ukkonen);
  • n-Gram距离(Kondrak);
  • Jaccard指数;
  • 索伦森 – 骰子系数;
  • 余弦相似;

你可以在这里find解释和Java实现: https : //github.com/tdebatty/java-string-similarity

你可以使用apache commons java库来实现这个function。 看看这两个函数:
– getLevenshteinDistance
– getFuzzyDistance

理论上,您可以比较编辑距离 。

这通常是使用编辑距离度量来完成的。 search“编辑距离java”会出现一些库,就像这个一样。

听起来像是一个剽窃发现者 ,如果你的string变成一个文件。 也许用这个词search会变得很好。

“编程集体智慧”有一个确定两个文件是否相似的章节。 这个代码是用Python编写的,但是它很干净而且容易移植。

感谢第一个回答者,我认为有computeCompitDistance(s1,s2)的2个计算。 由于花费时间太多,决定改善代码的性能。 所以:

 public class LevenshteinDistance { public static int computeEditDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) { costs[j] = j; } else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) { newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; } costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) { costs[s2.length()] = lastValue; } } return costs[s2.length()]; } public static void printDistance(String s1, String s2) { double similarityOfStrings = 0.0; int editDistance = 0; if (s1.length() < s2.length()) { // s1 should always be bigger String swap = s1; s1 = s2; s2 = swap; } int bigLen = s1.length(); editDistance = computeEditDistance(s1, s2); if (bigLen == 0) { similarityOfStrings = 1.0; /* both strings are zero length */ } else { similarityOfStrings = (bigLen - editDistance) / (double) bigLen; } ////////////////////////// //System.out.println(s1 + "-->" + s2 + ": " + // editDistance + " (" + similarityOfStrings + ")"); System.out.println(editDistance + " (" + similarityOfStrings + ")"); } public static void main(String[] args) { printDistance("", ""); printDistance("1234567890", "1"); printDistance("1234567890", "12"); printDistance("1234567890", "123"); printDistance("1234567890", "1234"); printDistance("1234567890", "12345"); printDistance("1234567890", "123456"); printDistance("1234567890", "1234567"); printDistance("1234567890", "12345678"); printDistance("1234567890", "123456789"); printDistance("1234567890", "1234567890"); printDistance("1234567890", "1234567980"); printDistance("47/2010", "472010"); printDistance("47/2010", "472011"); printDistance("47/2010", "AB.CDEF"); printDistance("47/2010", "4B.CDEFG"); printDistance("47/2010", "AB.CDEFG"); printDistance("The quick fox jumped", "The fox jumped"); printDistance("The quick fox jumped", "The fox"); printDistance("The quick fox jumped", "The quick fox jumped off the balcany"); printDistance("kitten", "sitting"); printDistance("rosettacode", "raisethysword"); printDistance(new StringBuilder("rosettacode").reverse().toString(), new StringBuilder("raisethysword").reverse().toString()); for (int i = 1; i < args.length; i += 2) { printDistance(args[i - 1], args[i]); } } }