两个文本文件之间的相似性

我正在研究任何语言的NLP项目（尽pipePython将是我的首选）。

我想写一个程序，将采取两个文件，并确定它们有多相似。

由于我对此很新，而且Google的searchfunction并没有太多提示。您是否知道涵盖此主题的任何参考资料（网站，教科书，期刊文章），对我有帮助？

谢谢

这样做的常用方法是将文档转换为tf-idf向量，然后计算它们之间的余弦相似度。任何关于信息检索（IR）的教科书都涵盖了这一点。见esp。 信息检索简介 ，免费并可在线获取。

Tf-idf（和类似的文本转换）在Python包Gensim和scikit-learn中实现。在后一种情况下，计算余弦的相似性就像一样容易

from sklearn.feature_extraction.text import TfidfVectorizer documents = [open(f) for f in text_files] tfidf = TfidfVectorizer().fit_transform(documents) # no need to normalize, since Vectorizer will return normalized tf-idf pairwise_similarity = tfidf * tfidf.T

或者，如果文件是简单的string，

 >>> vect = TfidfVectorizer(min_df=1) >>> tfidf = vect.fit_transform(["I'd like an apple", ... "An apple a day keeps the doctor away", ... "Never compare an apple to an orange", ... "I prefer scikit-learn to Orange"]) >>> (tfidf * tfidf.T).A array([[ 1. , 0.25082859, 0.39482963, 0. ], [ 0.25082859, 1. , 0.22057609, 0. ], [ 0.39482963, 0.22057609, 1. , 0.26264139], [ 0. , 0. , 0.26264139, 1. ]])

虽然Gensim可能有更多的select这种任务。

另见这个问题。

[免责声明：我参与了scikit-learn tf-idf的实现。]

相同于@larsman，但有一些预处理

 import nltk, string from sklearn.feature_extraction.text import TfidfVectorizer nltk.download('punkt') # if necessary... stemmer = nltk.stem.porter.PorterStemmer() remove_punctuation_map = dict((ord(char), None) for char in string.punctuation) def stem_tokens(tokens): return [stemmer.stem(item) for item in tokens] '''remove punctuation, lowercase, stem''' def normalize(text): return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map))) vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english') def cosine_sim(text1, text2): tfidf = vectorizer.fit_transform([text1, text2]) return ((tfidf * tfidf.T).A)[0,1] print cosine_sim('a little bird', 'a little bird') print cosine_sim('a little bird', 'a little bird chirps') print cosine_sim('a little bird', 'a big dog barks')

一般而言，两个文档之间的余弦相似度被用作文档的相似性度量。在Java中，您可以使用Lucene （如果您的集合非常大）或LingPipe来执行此操作。基本概念是计算每个文档中的术语并计算术语向量的点积。这些库对这种通用方法提供了一些改进，例如使用逆文档频率和计算tf-idf向量。如果您想要做一些copmlex，LingPipe还提供了计算文档间LSA相似性的方法，这些方法比余弦相似性得到更好的结果。对于Python，您可以使用NLTK 。

这是一个古老的问题，但我发现这可以用Spacy轻松完成。一旦文档被读取，就可以使用简单的api similarity来find文档向量之间的余弦相似度。

 import spacy nlp = spacy.load('en') doc1 = nlp(u'Hello hi there!') doc2 = nlp(u'Hello hi there!') doc3 = nlp(u'Hey whatsup?') print doc1.similarity(doc2) # 0.999999954642 print doc2.similarity(doc3) # 0.699032527716 print doc1.similarity(doc3) # 0.699032527716

这是一个小应用程序，让你开始…

 import difflib as dl a = file('file').read() b = file('file1').read() sim = dl.get_close_matches s = 0 wa = a.split() wb = b.split() for i in wa: if sim(i, wb): s += 1 n = float(s) / float(len(wa)) print '%d%% similarity' % int(n * 100)

你可能想尝试这个在线服务的余弦文件相似度http://www.scurtu.it/documentSimilarity.html

 import urllib,urllib2 import json API_URL="http://www.scurtu.it/apis/documentSimilarity" inputDict={} inputDict['doc1']='Document with some text' inputDict['doc2']='Other document with some text' params = urllib.urlencode(inputDict) f = urllib2.urlopen(API_URL, params) response= f.read() responseObject=json.loads(response) print responseObject

两个文本文件之间的相似性

有没有一个很好的自然语言处理库

使用NLTK清除停用词

词义化与词干的真正区别是什么？

从文本内容生成代码

你如何实现“你的意思”？

代码高尔夫：数字到单词

程序员（或计算机科学家）应该知道什么统计数据？

我在哪里可以了解更多关于谷歌search“你的意思”algorithm？

如何检查一个string看起来是随机的，或人类生成和pronouncable？

实体提取/识别与免费工具同时喂Lucene指数