模糊string比较

我正在努力完成的是一个程序,该程序读入一个文件,并根据原来的句子来比较每个句子。 与原文完美匹配的句子将得到1分,相反的句子将得到0分。所有其他的模糊语句将在1到0之间得到一个分数。

我不确定要使用哪种操作来允许我在Python 3中完成此操作。

我已经包含了示例文本,其中文本1是原始文本,其他前面的string是比较。

文本:示例

文本1:这是一个黑暗和暴风雨的夜晚。 我独自坐在一把红色的椅子上。 我并不完全孤单,因为我有三只猫。

案文20:这是一个阴暗暴风雨的夜晚。 我独自一人坐在深红色的椅子上。 我并不完全孤单,因为我有三只猫,//得分高,但不是1

文本21:这是一个阴暗暴躁的夜晚。 我独自一人坐在深红的教堂里。 我没有完全孤单,因为我有三只猫科动物//比分数低20

文字22:我独自一人坐在深红色的教堂里。 我并不完全孤单,因为我有三只猫。 这是一个阴暗暴风雨的夜晚。 //得分低于文本21但不是0

文字24:这是一个黑暗和暴风雨的夜晚。 我并不孤单。 我没有坐在红色的椅子上。 我有三只猫。 / /应该得分0!

有一个叫fuzzywuzzy的包。 通过点安装:

 pip install fuzzywuzzy 

简单的用法:

 >>> from fuzzywuzzy import fuzz >>> fuzz.ratio("this is a test", "this is a test!") 96 

这个软件包是build立在difflib之上的。 为什么不使用这个,你问? 除了简单一点之外,它还有许多不同的匹配方法(如令牌不敏感,部分string匹配),这使得它在实践中更加强大。 process.extract函数特别有用:从集合中find最匹配的string和比率。 从自述:

偏比率

 >>> fuzz.partial_ratio("this is a test", "this is a test!") 100 

令牌分类比例

 >>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 90 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 

令牌集合比率

 >>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 84 >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100 

处理

 >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) 

标准库(称为difflib )中有一个模块,可以比较string并根据其相似性返回分数。 SequenceMatcher类应该做你以后的事情。

编辑:从Python提示的小例子:

 >>> from difflib import SequenceMatcher as SM >>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.' >>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.' >>> SM(None, s1, s2).ratio() 0.9112903225806451 

HTH!

fuzzysetfuzzywuzzydifflib )在索引和search上快得多。

 from fuzzyset import FuzzySet corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.""" corpus = [line.lstrip() for line in corpus.split("\n")] fs = FuzzySet(corpus) query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats." fs.get(query) # [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')] 

警告:小心不要在您的模糊集中混用unicodebytes

这个任务被称为释义识别 ,它是自然语言处理研究的一个活跃领域。 我已经链接了几种最先进的论文,其中很多你可以在GitHub上find开源代码。

请注意,所有回答的问题都假设两个句子之间存在一些string/表面的相似性,而实际上,两个string相似度较低的句子在语义上可能是相似的。

如果你对这种相似性感兴趣,你可以使用Skip-Thoughts 。 按照GitHub指南安装软件,并转到自述文件中的释义检测部分:

 import skipthoughts model = skipthoughts.load_model() vectors = skipthoughts.encode(model, X_sentences) 

这将您的句子(X_sentences)转换为向量。 稍后,您可以通过以下方式find两个向量的相似性

 similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1]) 

我们假设vector[0]和vector1是X_sentences [0],X_sentences 1的对应vector,您希望find它们的分数。

还有其他模型可以将句子转换为vector,您可以在这里find它。

一旦将句子转换为vector,相似性只是寻找这些vector之间的余弦相似度的问题。