Python：tf-idf-cosine：查找文档相似度

我正在按照第1部分和第2 部分提供的教程，不幸的是，作者没有时间做最后部分，其中涉及使用余弦来真正find两个文档之间的相似性。我在文中的例子跟随从以下链接的帮助从stackoverflow我已经包括在上面的链接中提到的代码只是为了让答案生活容易。

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

由于上面的代码我有以下matrix

 Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

我不知道如何使用这个输出来计算余弦相似度，我知道如何实现余弦相似性相对于两个具有相似长度的向量，但在这里我不知道如何识别这两个向量。

首先，如果你想提取计数function，并应用TF-IDF规范化和按行欧几里德规范化，你可以在一个操作与TfidfVectorizer ：

 >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314x130088 sparse matrix of type '<type 'numpy.float64'>' with 1787553 stored elements in Compressed Sparse Row format>

现在要查找一个文档（例如数据集中的第一个）的余弦距离，以及其他所有其他文件，只需要计算第一个向量的点积与所有其他文件的点积即可，因为tfidf向量已经进行了行归一化。 scipy稀疏matrixAPI有点奇怪（不像密集的N维numpy数组那么灵活）。为了得到第一个向量，你需要按行划分matrix来得到一个单行的子matrix：

 >>> tfidf[0:1] <1x130088 sparse matrix of type '<type 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>

scikit-learn已经提供了用于向量集合的密集和稀疏表示的成对度量（也称为机器学习术语中的内核）。在这种情况下，我们需要一个也称为线性内核的点积：

 >>> from sklearn.metrics.pairwise import linear_kernel >>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() >>> cosine_similarities array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602, 0.04457106, 0.03293218])

因此，要find前5个相关文档，我们可以使用argsort和一些负数组切片（大多数相关文档具有最高的余弦相似度值，因此在sorting的索引数组的末尾）：

 >>> related_docs_indices = cosine_similarities.argsort()[:-5:-1] >>> related_docs_indices array([ 0, 958, 10576, 3277]) >>> cosine_similarities[related_docs_indices] array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

第一个结果是一个完整性检查：我们发现查询文档是最相似的文档，余弦相似性分数为1，其文字如下：

 >>> print twenty.data[0] From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----

第二个最相似的文件是引用原始信息的回复，因此有许多常见的词汇：

 >>> print twenty.data[958] From: rseymour@reed.edu (Robert Seymour) Subject: Re: WHAT car is this!? Article-ID: reed.1993Apr21.032905.29286 Reply-To: rseymour@reed.edu Organization: Reed College, Portland, OR Lines: 26 In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my thing) writes: > > I was wondering if anyone out there could enlighten me on this car I saw > the other day. It was a 2-door sports car, looked to be from the late 60s/ > early 70s. It was called a Bricklin. The doors were really small. In addition, > the front bumper was separate from the rest of the body. This is > all I know. If anyone can tellme a model name, engine specs, years > of production, where this car is made, history, or whatever info you > have on this funky looking car, please e-mail. Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top. > ---- brought to you by your neighborhood Lerxst ---- Rush fan? -- Robert Seymour rseymour@reed.edu Physics and Philosophy, Reed College (NeXTmail accepted) Artificial Life Project Reed College Reed Solar Energy Project (SolTrain) Portland, OR

我知道它是一个旧post。但我尝试了http://scikit-learn.sourceforge.net/stable/包。; 这里是我的代码来find余弦相似性。问题是如何计算这个包的余弦相似度，这里是我的代码

 from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer f = open("/root/Myfolder/scoringDocuments/doc1") doc1 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc2") doc2 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc3") doc3 = str.decode(f.read(), "UTF-8", "ignore") train_set = ["president of India",doc1, doc2, doc3] tfidf_vectorizer = TfidfVectorizer() tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set) #finds the tfidf score with normalization print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train) #here the first element of tfidf_matrix_train is matched with other three elements

这里假设查询是train_set和doc1的第一个元素，doc2和doc3是我想在余弦相似度的帮助下sorting的文档。那么我可以使用这个代码。

这个问题中提供的教程也非常有用。以下是第一部分，第二部分，第三部分的所有部分

输出结果如下：

 [[ 1. 0.07102631 0.02731343 0.06348799]]

这里1表示查询与自己匹配，其余三个是用于匹配查询与相应文档的分数。

在@ excray的评论的帮助下，我设法找出答案，我们需要做的是实际上写一个简单的for循环遍历代表列车数据和testing数据的两个数组。

首先实现一个简单的lambda函数来保存余弦计算的公式：

 cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

然后，只需编写一个简单的for循环遍历to向量，逻辑就是针对每个“对于trainVectorizerArray中的每个向量，都必须在testVectorizerArray中find与向量的余弦相似度”。

 from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) for vector in trainVectorizerArray: print vector for testV in testVectorizerArray: print testV cosine = cx(vector, testV) print cosine transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

这是输出：

 Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [1 0 1 0] [0 1 1 1] 0.408 [0 1 0 1] [0 1 1 1] 0.816 [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

让我给你另一个由我写的教程。它回答你的问题，但也解释了为什么我们正在做一些事情。我也试图使其简洁。

所以你有一个list_of_documents ，它只是一个string数组，而另一个document只是一个string。您需要从list_of_documents中find与document最相似的document 。

让我们把它们结合在一起： documents = list_of_documents + [document]

让我们从依赖关系开始。这将变得清楚为什么我们使用他们每个人。

 from nltk.corpus import stopwords import string from nltk.tokenize import wordpunct_tokenize as tokenize from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer from scipy.spatial.distance import cosine

其中一种可以使用的方法就是一种袋子式的方法，在这种方式中，我们把文档中的每个单词都独立于其他单词，并把它们全部放在一起。从一个angular度来看，它丢失了大量的信息（比如这些词是如何连接的），但是从另一个angular度来看，它使模型变得简单。

在英语和其他任何人类语言中，都有许多“无用”的词，如“一”，“这个”，“在”中，这些词如此普遍以至于没有很多意义。他们被称为停止词，并删除它们是一个好主意。另一件可以注意到的事情就是“分析”，“分析器”，“分析”等词语是非常相似的。他们有一个共同的根，所有可以转换为只有一个字。这个过程被称为词干，存在不同的速度，攻击性等不同的词干。所以我们把每个文档转换成没有停顿词的词干表。我们也放弃所有的标点符号。

 porter = PorterStemmer() stop_words = set(stopwords.words('english')) modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

那么这一句话怎么帮助我们呢？想象一下，我们有3袋： [a, b, c] ， [a, c, a]和[b, c, d] 。我们可以在 [a, b, c, d] 的基础上将它们转换成向量。所以我们最终得到了vector： [1, 1, 1, 0] ， [2, 0, 1, 0]和[0, 1, 1, 1] 。类似的事情是与我们的文件（只有向量将会更长）。现在我们看到，我们删除了大量的词，并阻止了向量的尺寸。这里只是有趣的观察。更长的文档将有更多的积极因素比较短，这就是为什么正常化vector是很好的。这被称为术语频率TF，人们还使用关于在其他文档中多久使用该单词的附加信息 – 逆文件频率IDF。在一起，我们有一个TF-IDF，它有几个口味。这可以通过sklearn中的一行来实现:-)

 modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses. tf_idf = TfidfVectorizer().fit_transform(modified_doc)

事实上，vector化器可以做很多事情，比如删除停用词和小写。我已经做了他们在一个单独的步骤，因为sklearn没有非英语停用词，但nltk了。

所以我们有所有的vector计算。最后一步是找出哪一个与最后一个最相似。有很多种方法可以做到这一点，其中之一就是欧几里德距离，因为这里所讨论的原因，它并不是那么好。另一种方法是余弦相似。我们迭代所有的文档，并计算文档和最后一个之间的余弦相似度：

 l = len(documents) - 1 for i in xrange(l): minimum = (1, None) minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum) print minimum

现在最低限度会有关于最佳文档及其分数的信息。

这应该对你有所帮助

 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(train_set) print tfidf_matrix cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix) print cosine

和输出将是：

 [[ 0.34949812 0.81649658 1. ]]

Python：tf-idf-cosine：查找文档相似度

Google Alerts API？

计算趋势主题或标签的最佳方式是什么？