一个更好的相似性sortingalgorithm的变长string

我正在寻找一个string相似度algorithm,可变长度string比通常build议(levenshtein距离,soundex等)产生更好的结果。

例如,

给定stringA:“Robert”,

然后串B:“艾米·罗伯逊”

会比比赛更好

stringC:“理查德”

另外,优选地,该algorithm应该是语言不可知的(也可以用非英语的语言)。

Catalysoft的Simon White写了一篇关于一个非常聪明的algorithm的文章,该algorithm比较了相邻字符对,这对我的目的来说非常合适:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon有一个Java版本的algorithm,下面我写了一个PL / Ruby版本(从Mark Wong-VanHaren的相关论坛入口评论中的简单ruby版本中获得),以便我可以在我的PostgreSQL查询中使用它:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar) RETURNS float8 AS ' str1.downcase! pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject { |pair| pair.include? " "} str2.downcase! pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject { |pair| pair.include? " "} union = pairs1.size + pairs2.size intersection = 0 pairs1.each do |p1| 0.upto(pairs2.size-1) do |i| if p1 == pairs2[i] intersection += 1 pairs2.slice!(i) break end end end (2.0 * intersection) / union ' LANGUAGE 'plruby'; 

奇迹般有效!

marzagao的答案是伟大的。 我把它转换为C#,所以我想我会在这里发布:

Pastebin链接

 /// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public class SimilarityTool { /// <summary> /// Compares the two strings based on letter pair matches /// </summary> /// <param name="str1"></param> /// <param name="str2"></param> /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns> public double CompareStrings(string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; } } 

这是marzagao的答案的另一个版本,这是用Python编写的:

 def get_bigrams(string): """ Take a string and return a list of bigrams. """ s = string.lower() return [s[i:i+2] for i in list(range(len(s) - 1))] def string_similarity(str1, str2): """ Perform bigram comparison between two strings and return a percentage match in decimal form. """ pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) union = len(pairs1) + len(pairs2) hit_count = 0 for x in pairs1: for y in pairs2: if x == y: hit_count += 1 break return (2.0 * hit_count) / union if __name__ == "__main__": """ Run a test using the example taken from: http://www.catalysoft.com/articles/StrikeAMatch.html """ w1 = 'Healed' words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold'] for w2 in words: print('Healed --- ' + w2) print(string_similarity(w1, w2)) print() 

下面是Simon White提出的StrikeAatchalgorithm的PHP实现。 优点(就像它在链接中说的)是:

  • 词汇相似性的真实反映 – 小差异的string应被认为是相似的。 特别是,重要的子string重叠应指向string之间的高度相似性。

  • 对词序改变的稳健性 – 包含相同词语但是以不同顺序的两个串应被认为是相似的。 另一方面,如果一个string只是另一个字符中包含的字符的随机字母,那么它应该(通常)被认为是不相似的。

  • 语言独立性 – algorithm不仅应该用英语,而且要用许多不同的语言。

 <?php /** * LetterPairSimilarity algorithm implementation in PHP * @author Igal Alkon * @link http://www.catalysoft.com/articles/StrikeAMatch.html */ class LetterPairSimilarity { /** * @param $str * @return mixed */ private function wordLetterPairs($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each word for ($w = 0; $w < count($words); $w++) { // Find the pairs of characters $pairsInWord = $this->letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p++) { $allPairs[] = $pairsInWord[$p]; } } return $allPairs; } /** * @param $str * @return array */ private function letterPairs($str) { $numPairs = mb_strlen($str)-1; $pairs = array(); for ($i = 0; $i < $numPairs; $i++) { $pairs[$i] = mb_substr($str,$i,2); } return $pairs; } /** * @param $str1 * @param $str2 * @return float */ public function compareStrings($str1, $str2) { $pairs1 = $this->wordLetterPairs(strtoupper($str1)); $pairs2 = $this->wordLetterPairs(strtoupper($str2)); $intersection = 0; $union = count($pairs1) + count($pairs2); for ($i=0; $i < count($pairs1); $i++) { $pair1 = $pairs1[$i]; $pairs2 = array_values($pairs2); for($j = 0; $j < count($pairs2); $j++) { $pair2 = $pairs2[$j]; if ($pair1 === $pair2) { $intersection++; unset($pairs2[$j]); break; } } } return (2.0*$intersection)/$union; } } 

@John Rutledge的答案的缩写版本:

 def get_bigrams(string): ''' Takes a string and returns a list of bigrams ''' s = string.lower() return {s[i:i+2] for i in xrange(len(s) - 1)} def string_similarity(str1, str2): ''' Perform bigram comparison between two strings and return a percentage match in decimal form ''' pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) intersection = pairs1 & pairs2 return (2.0 * len(intersection)) / (len(pairs1) + len(pairs2)) 

这个讨论真的很有帮助,谢谢。 我将algorithm转换为VBA以便与Excel一起使用,并编写几个版本的工作表函数,一个用于简单比较一对string,另一个用于将一个string与string的范围/数组进行比较。 strSimLookup版本返回最后一个最佳匹配作为string,数组索引或相似性度量标准。

这个实现产生了Simon White网站上亚马逊例子中列出的相同结果,只有less数例外情况是低分的。 不知道差异在哪里蔓延,可能是VBA的拆分function,但我没有调查,因为它对我的目的工作正常。

 'Implements functions to rate how similar two strings are on 'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar) 'Source:  http://www.catalysoft.com/articles/StrikeAMatch.html 'Author: Bob Chatham, bob.chatham at gmail.com '9/12/2010 Option Explicit Public Function stringSimilarity(str1 As String, str2 As String) As Variant 'Simple version of the algorithm that computes the similiarity metric 'between two strings. 'NOTE: This verision is not efficient to use if you're comparing one string 'with a range of other values as it will needlessly calculate the pairs for the 'first string over an over again; use the array-optimized version for this case. Dim sPairs1 As Collection Dim sPairs2 As Collection Set sPairs1 = New Collection Set sPairs2 = New Collection WordLetterPairs str1, sPairs1 WordLetterPairs str2, sPairs2 stringSimilarity = SimilarityMetric(sPairs1, sPairs2) Set sPairs1 = Nothing Set sPairs2 = Nothing End Function Public Function strSimA(str1 As Variant, rRng As Range) As Variant 'Return an array of string similarity indexes for str1 vs every string in input range rRng Dim sPairs1 As Collection Dim sPairs2 As Collection Dim arrOut As Variant Dim l As Long, j As Long Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 l = rRng.Count ReDim arrOut(1 To l) For j = 1 To l Set sPairs2 = New Collection WordLetterPairs CStr(rRng(j)), sPairs2 arrOut(j) = SimilarityMetric(sPairs1, sPairs2) Set sPairs2 = Nothing Next j strSimA = Application.Transpose(arrOut) End Function Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant 'Return either the best match or the index of the best match 'depending on returnTYype parameter) between str1 and strings in rRng) ' returnType = 0 or omitted: returns the best matching string ' returnType = 1 : returns the index of the best matching string ' returnType = 2 : returns the similarity metric Dim sPairs1 As Collection Dim sPairs2 As Collection Dim metric, bestMetric As Double Dim i, iBest As Long Const RETURN_STRING As Integer = 0 Const RETURN_INDEX As Integer = 1 Const RETURN_METRIC As Integer = 2 If IsMissing(returnType) Then returnType = RETURN_STRING Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 bestMetric = -1 iBest = -1 For i = 1 To rRng.Count Set sPairs2 = New Collection WordLetterPairs CStr(rRng(i)), sPairs2 metric = SimilarityMetric(sPairs1, sPairs2) If metric > bestMetric Then bestMetric = metric iBest = i End If Set sPairs2 = Nothing Next i If iBest = -1 Then strSimLookup = CVErr(xlErrValue) Exit Function End If Select Case returnType Case RETURN_STRING strSimLookup = CStr(rRng(iBest)) Case RETURN_INDEX strSimLookup = iBest Case Else strSimLookup = bestMetric End Select End Function Public Function strSim(str1 As String, str2 As String) As Variant Dim ilen, iLen1, ilen2 As Integer iLen1 = Len(str1) ilen2 = Len(str2) If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1 strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen)) End Function Sub WordLetterPairs(str As String, pairColl As Collection) 'Tokenize str into words, then add all letter pairs to pairColl Dim Words() As String Dim word, nPairs, pair As Integer Words = Split(str) If UBound(Words) < 0 Then Set pairColl = Nothing Exit Sub End If For word = 0 To UBound(Words) nPairs = Len(Words(word)) - 1 If nPairs > 0 Then For pair = 1 To nPairs pairColl.Add Mid(Words(word), pair, 2) Next pair End If Next word End Sub Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant 'Helper function to calculate similarity metric given two collections of letter pairs. 'This function is designed to allow the pair collections to be set up separately as needed. 'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection 'if this is not the desired behavior. 'Also assumes that collections will be deallocated somewhere else Dim Intersect As Double Dim Union As Double Dim i, j As Long If sPairs1.Count = 0 Or sPairs2.Count = 0 Then SimilarityMetric = CVErr(xlErrNA) Exit Function End If Union = sPairs1.Count + sPairs2.Count Intersect = 0 For i = 1 To sPairs1.Count For j = 1 To sPairs2.Count If StrComp(sPairs1(i), sPairs2(j)) = 0 Then Intersect = Intersect + 1 sPairs2.Remove j Exit For End If Next j Next i SimilarityMetric = (2 * Intersect) / Union End Function 

对不起,答案不是笔者发明的。 这是数字设备公司(Digital Equipment Corporation)首先提出的一种众所周知的algorithm,通常被称为“混搭”(shingling)。

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

我把Simon White的algorithm翻译成PL / pgSQL。 这是我的贡献。

 <!-- language: lang-sql --> create or replace function spt1.letterpairs(in p_str varchar) returns varchar as $$ declare v_numpairs integer := length(p_str)-1; v_pairs varchar[]; begin for i in 1 .. v_numpairs loop v_pairs[i] := substr(p_str, i, 2); end loop; return v_pairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.wordletterpairs(in p_str varchar) returns varchar as $$ declare v_allpairs varchar[]; v_words varchar[]; v_pairsinword varchar[]; begin v_words := regexp_split_to_array(p_str, '[[:space:]]'); for i in 1 .. array_length(v_words, 1) loop v_pairsinword := spt1.letterpairs(v_words[i]); if v_pairsinword is not null then for j in 1 .. array_length(v_pairsinword, 1) loop v_allpairs := v_allpairs || v_pairsinword[j]; end loop; end if; end loop; return v_allpairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY) returns anyarray as $$ select array(select unnest($1) intersect select unnest($2)) $$ language 'sql'; --=================================================================== create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar) returns float as $$ declare v_pairs1 varchar[]; v_pairs2 varchar[]; v_intersection integer; v_union integer; begin v_pairs1 := wordletterpairs(upper(p_str1)); v_pairs2 := wordletterpairs(upper(p_str2)); v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1); return (2.0 * v_intersection / v_union); end; $$ language 'plpgsql'; 

该algorithm的PHP版本更快:

 /** * * @param $str * @return mixed */ private static function wordLetterPairs ($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each word for ($w = 0; $w < count($words); $w ++) { // Find the pairs of characters $pairsInWord = self::letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p ++) { $allPairs[$pairsInWord[$p]] = $pairsInWord[$p]; } } return array_values($allPairs); } /** * * @param $str * @return array */ private static function letterPairs ($str) { $numPairs = mb_strlen($str) - 1; $pairs = array(); for ($i = 0; $i < $numPairs; $i ++) { $pairs[$i] = mb_substr($str, $i, 2); } return $pairs; } /** * * @param $str1 * @param $str2 * @return float */ public static function compareStrings ($str1, $str2) { $pairs1 = self::wordLetterPairs(mb_strtolower($str1)); $pairs2 = self::wordLetterPairs(mb_strtolower($str2)); $union = count($pairs1) + count($pairs2); $intersection = count(array_intersect($pairs1, $pairs2)); return (2.0 * $intersection) / $union; } 

对于我的数据(约2300比较), Igal Alkon解决scheme的运行时间为0.58秒,而我的数据为0.35秒。

在美丽的斯卡拉版本:

  def pairDistance(s1: String, s2: String): Double = { def strToPairs(s: String, acc: List[String]): List[String] = { if (s.size < 2) acc else strToPairs(s.drop(1), if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2))) } val lst1 = strToPairs(s1.toUpperCase, List()) val lst2 = strToPairs(s2.toUpperCase, List()) (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size) } 

string相似性度量标准包含了string比较中使用的许多不同度量标准的综述( 维基百科也有一个概述)。 这些指标中的大部分都是在图书馆simmetrics中实施的。

另一个没有包含在给定的概述中的度量的例子是例如压缩距离 (试图近似Kolmogorov的复杂性 ),其可以用于比您所呈现的更长的文本。

您也可以考虑查看更广泛的自然语言处理主题。 这些 R包可以让你快速入门(或者至less给出一些想法)。

还有最后一个编辑 – 在SO上search关于这个主题的其他问题,有很多相关的问题。

这是R版本:

 get_bigrams <- function(str) { lstr = tolower(str) bigramlst = list() for(i in 1:(nchar(str)-1)) { bigramlst[[i]] = substr(str, i, i+1) } return(bigramlst) } str_similarity <- function(str1, str2) { pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) unionlen = length(pairs1) + length(pairs2) hit_count = 0 for(x in 1:length(pairs1)){ for(y in 1:length(pairs2)){ if (pairs1[[x]] == pairs2[[y]]) hit_count = hit_count + 1 } } return ((2.0 * hit_count) / unionlen) } 

在这个algorithm的启发下,在C99上发布了marzagao的答案

 double dice_match(const char *string1, const char *string2) { //check fast cases if (((string1 != NULL) && (string1[0] == '\0')) || ((string2 != NULL) && (string2[0] == '\0'))) { return 0; } if (string1 == string2) { return 1; } size_t strlen1 = strlen(string1); size_t strlen2 = strlen(string2); if (strlen1 < 2 || strlen2 < 2) { return 0; } size_t length1 = strlen1 - 1; size_t length2 = strlen2 - 1; double matches = 0; int i = 0, j = 0; //get bigrams and compare while (i < length1 && j < length2) { char a[3] = {string1[i], string1[i + 1], '\0'}; char b[3] = {string2[j], string2[j + 1], '\0'}; int cmp = strcmpi(a, b); if (cmp == 0) { matches += 2; } i++; j++; } return matches / (length1 + length2); } 

一些基于原文的testing:

 #include <stdio.h> void article_test1() { char *string1 = "FRANCE"; char *string2 = "FRENCH"; printf("====%s====\n", __func__); printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100); } void article_test2() { printf("====%s====\n", __func__); char *string = "Healed"; char *ss[] = {"Heard", "Healthy", "Help", "Herded", "Sealed", "Sold"}; int correct[] = {44, 55, 25, 40, 80, 0}; for (int i = 0; i < 6; ++i) { printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]); } } void multicase_test() { char *string1 = "FRaNcE"; char *string2 = "fREnCh"; printf("====%s====\n", __func__); printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100); } void gg_test() { char *string1 = "GG"; char *string2 = "GGGGG"; printf("====%s====\n", __func__); printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100); } int main() { article_test1(); article_test2(); multicase_test(); gg_test(); return 0; } 

基于Michael La Voie的C#版本,根据请求使其成为扩展方法,下面是我想到的。 这样做的主要好处是,您可以通过百分比匹配来sorting通用列表。 例如,考虑在对象中有一个名为“City”的string字段。 用户search“Chester”,并且想要按照匹配的降序返回结果。 例如,你想让切斯特的文字匹配出现在罗切斯特之前。 为此,请向对象添加两个新属性:

  public string SearchText { get; set; } public double PercentMatch { get { return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper()); } } 

然后在每个对象上,将SearchText设置为用户search的内容。 那么你可以用下面这样的东西来轻松地sorting:

  zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch); 

下面是做一个扩展方法的细微修改:

  /// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public static double PercentMatchTo(this string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private static List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private static string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; } 

我的JavaScript实现需要一个string或string数​​组,以及一个可选楼层(默认楼层为0.5)。 如果你传递一个string,它将返回true或false,取决于string的相似度是否大于或等于floor。 如果将它传递给一个string数组,它将返回一个数组,这些string的相似度分数大于或等于楼层,按分数sorting。

例子:

 'Healed'.fuzzy('Sealed'); // returns true 'Healed'.fuzzy('Help'); // returns false 'Healed'.fuzzy('Help', 0.25); // returns true 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']); // returns ["Sealed", "Healthy"] 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0); // returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"] 

这里是:

 (function(){ var default_floor = 0.5; function pairs(str){ var pairs = [] , length = str.length - 1 , pair; str = str.toLowerCase(); for(var i = 0; i < length; i++){ pair = str.substr(i, 2); if(!/\s/.test(pair)){ pairs.push(pair); } } return pairs; } function similarity(pairs1, pairs2){ var union = pairs1.length + pairs2.length , hits = 0; for(var i = 0; i < pairs1.length; i++){ for(var j = 0; j < pairs1.length; j++){ if(pairs1[i] == pairs2[j]){ pairs2.splice(j--, 1); hits++; break; } } } return 2*hits/union || 0; } String.prototype.fuzzy = function(strings, floor){ var str1 = this , pairs1 = pairs(this); floor = typeof floor == 'number' ? floor : default_floor; if(typeof(strings) == 'string'){ return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase(); }else if(strings instanceof Array){ var scores = {}; strings.map(function(str2){ scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase()); }); return strings.filter(function(str){ return scores[str] >= floor; }).sort(function(a, b){ return scores[b] - scores[a]; }); } }; })(); 

为了您的方便,这里有一个缩小版本:

 (function(){function g(a){var b=[],e=a.length-1,d;a=a.toLowerCase();for(var c=0;c<e;c++)d=a.substr(c,2),/\s/.test(d)||b.push(d);return b}function h(a,b){for(var e=a.length+b.length,d=0,c=0;c<a.length;c++)for(var f=0;f<a.length;f++)if(a[c]==b[f]){b.splice(f--,1);d++;break}return 2*d/e||0}String.prototype.fuzzy=function(a,b){var e=this,d=g(this);b="number"==typeof b?b:0.5;if("string"==typeof a)return 1<e.length&&1<a.length&&h(d,g(a))>=b||e.toLowerCase()==a.toLowerCase();if(a instanceof Array){var c={};a.map(function(a){c[a]=1<e.length?h(d,g(a)):1*(e.toLowerCase()==a.toLowerCase())});return a.filter(function(a){return c[a]>=b}).sort(function(a,b){return c[b]-c[a]})}}})(); 

Dice系数algorithm(Simon White / marzagao的答案)是在Ruby中用a_match gem中的pair_distance_similar方法实现的

https://github.com/flori/amatch

这个gem还包含一些近似匹配和string比较algorithm的实现:Levenshtein编辑距离,Sellers编辑距离,汉明距离,最长公共子序列长度,最长公共子串长度,对距离度量,Jaro-Winkler度量。

一个Haskell版本 – 随意提出编辑,因为我没有做很多的Haskell。

 import Data.Char import Data.List -- Convert a string into words, then get the pairs of words from that phrase wordLetterPairs :: String -> [String] wordLetterPairs s1 = concat $ map pairs $ words s1 -- Converts a String into a list of letter pairs. pairs :: String -> [String] pairs [] = [] pairs (x:[]) = [] pairs (x:ys) = [x, head ys]:(pairs ys) -- Calculates the match rating for two strings matchRating :: String -> String -> Double matchRating s1 s2 = (numberOfMatches * 2) / totalLength where pairsS1 = wordLetterPairs $ map toLower s1 pairsS2 = wordLetterPairs $ map toLower s2 numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2 totalLength = fromIntegral $ length pairsS1 + length pairsS2 

Clojure的:

 (require '[clojure.set :refer [intersection]]) (defn bigrams [s] (->> (split s #"\s+") (mapcat #(partition 2 1 %)) (set))) (defn string-similarity [ab] (let [a-pairs (bigrams a) b-pairs (bigrams b) total-count (+ (count a-pairs) (count b-pairs)) match-count (count (intersection a-pairs b-pairs)) similarity (/ (* 2 match-count) total-count)] similarity)) 

什么Levenshtein距离,除以第一个string的长度(或可选地划分我的最小/最大/平均两个string的长度)? 到目前为止,这对我来说是有效的。

嘿家伙,我给了这个在JavaScript尝试,但我是新来的,任何人都知道更快的方法来做到这一点?

 function get_bigrams(string) { // Takes a string and returns a list of bigrams var s = string.toLowerCase(); var v = new Array(s.length-1); for (i = 0; i< v.length; i++){ v[i] =s.slice(i,i+2); } return v; } function string_similarity(str1, str2){ /* Perform bigram comparison between two strings and return a percentage match in decimal form */ var pairs1 = get_bigrams(str1); var pairs2 = get_bigrams(str2); var union = pairs1.length + pairs2.length; var hit_count = 0; for (x in pairs1){ for (y in pairs2){ if (pairs1[x] == pairs2[y]){ hit_count++; } } } return ((2.0 * hit_count) / union); } var w1 = 'Healed'; var word =['Heard','Healthy','Help','Herded','Sealed','Sold'] for (w2 in word){ console.log('Healed --- ' + word[w2]) console.log(string_similarity(w1,word[w2])); } 

这是基于Sørensen-Dice索引(marzagao的答案)的另一个版本,用C ++ 11编写:

 /* * Similarity based in Sørensen–Dice index. * * Returns the Similarity between _str1 and _str2. */ double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) { // Base case: if some string is empty. if (_str1.empty() || _str2.empty()) { return 1.0; } auto str1 = upper_string(_str1); auto str2 = upper_string(_str2); // Base case: if the strings are equals. if (str1 == str2) { return 0.0; } // Base case: if some string does not have bigrams. if (str1.size() < 2 || str2.size() < 2) { return 1.0; } // Extract bigrams from str1 auto num_pairs1 = str1.size() - 1; std::unordered_set<std::string> str1_bigrams; str1_bigrams.reserve(num_pairs1); for (unsigned i = 0; i < num_pairs1; ++i) { str1_bigrams.insert(str1.substr(i, 2)); } // Extract bigrams from str2 auto num_pairs2 = str2.size() - 1; std::unordered_set<std::string> str2_bigrams; str2_bigrams.reserve(num_pairs2); for (unsigned int i = 0; i < num_pairs2; ++i) { str2_bigrams.insert(str2.substr(i, 2)); } // Find the intersection between the two sets. int intersection = 0; if (str1_bigrams.size() < str2_bigrams.size()) { const auto it_e = str2_bigrams.end(); for (const auto& bigram : str1_bigrams) { intersection += str2_bigrams.find(bigram) != it_e; } } else { const auto it_e = str1_bigrams.end(); for (const auto& bigram : str2_bigrams) { intersection += str1_bigrams.find(bigram) != it_e; } } // Returns similarity coefficient. return (2.0 * intersection) / (num_pairs1 + num_pairs2); } 

I was looking for pure ruby implementation of the algorithm indicated by @marzagao's answer. Unfortunately, link indicated by @marzagao is broken. In @s01ipsist answer, he indicated ruby gem amatch where implementation is not in pure ruby. So I searchd a little and found gem fuzzy_match which has pure ruby implementation (though this gem use amatch ) at here . I hope this will help someone like me.