在Python中简单实现N-Gram,tf-idf和余弦相似性

我需要比较存储在数据库中的文档,并得出0到1之间的相似性分数。

我需要使用的方法非常简单。 实现n-gram的vanilla版本(可以定义使用多less克),以及tf-idf和Cosine相似度的简单实现。

有没有什么程序可以做到这一点? 还是应该从头开始写这个?

退房NLTK包: http ://www.nltk.org它拥有你所需要的一切

对于余弦相似性:

def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to uv / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

对于ngrams:

 def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [tuple(sequence[i:i+n]) for i in range(count)] 

对于tf-idf,你将不得不首先计算分布,我使用Lucene来做到这一点,但是你可能很好的做了和NLTK类似的东西,使用FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

如果你喜欢pylucene,这会告诉你如何通过tf.idf

  # reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom 

如果您有兴趣,我已经完成了教程系列( 第一部分和第二部分 )讨论tf-idf并使用Scikits.learn(sklearn) Python模块。

第3部分有余弦相似性。

简而言之,下面是python + numpy的一个答案:

余弦

 def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) 

Ngrams

 def ngrams(sentence, n): return zip(*[sentence.split()[i:] for i in range(n)]) 

TF-IDF (这有点奇怪,但它的工作原理):

 def tfidf(corpus, vocab): """ INPUT: corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] 

以下是testing的答案:

 import numpy as np from math import sqrt, log from itertools import chain, product from collections import defaultdict def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) def ngrams(sentence, n): return zip(*[sentence.split()[i:] for i in range(n)]) def tfidf(corpus, vocab): """ INPUT: corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] def corpus2vectors(corpus): def vectorize(sentence, vocab): return [sentence.split().count(i) for i in vocab] vectorized_corpus = [] vocab = sorted(set(chain(*[i.lower().split() for i in corpus]))) for i in corpus: vectorized_corpus.append((i, vectorize(i, vocab))) return vectorized_corpus, vocab def create_test_corpus(): sent1 = "this is a foo bar" sent2 = "foo bar bar black sheep" sent3 = "this is a sentence" all_sents = [sent1,sent2,sent3] corpus, vocab = corpus2vectors(all_sents) return corpus, vocab def test_cosine(): corpus, vocab = create_test_corpus() for sentx, senty in product(corpus, corpus): print sentx[0] print senty[0] print "cosine =", cosine_sim(sentx[1], senty[1]) print def test_ngrams(): corpus, vocab = create_test_corpus() for sentx in corpus: print sentx[0] print ngrams(sentx[0],2) print ngrams(sentx[0],3) print def test_tfidf(): corpus, vocab = create_test_corpus() print corpus print vocab print tfidf(corpus, vocab) print "Testing cosine..." test_cosine() print print "Testing ngrams..." test_ngrams() print print "Testing tfidf..." test_tfidf() print 

[OUT]:

 Testing cosine... this is a foo bar this is a foo bar cosine = 1.0 this is a foo bar foo bar bar black sheep cosine = 0.507092552837 this is a foo bar this is a sentence cosine = 0.67082039325 foo bar bar black sheep this is a foo bar cosine = 0.507092552837 foo bar bar black sheep foo bar bar black sheep cosine = 1.0 foo bar bar black sheep this is a sentence cosine = 0.0 this is a sentence this is a foo bar cosine = 0.67082039325 this is a sentence foo bar bar black sheep cosine = 0.0 this is a sentence this is a sentence cosine = 1.0 Testing ngrams... this is a foo bar [('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')] [('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')] foo bar bar black sheep [('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')] [('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')] this is a sentence [('this', 'is'), ('is', 'a'), ('a', 'sentence')] [('this', 'is', 'a'), ('is', 'a', 'sentence')] Testing tfidf... [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] [[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 

对于我们的信息检索课程,我们使用一些由Java教授编写的代码。 对不起,没有python端口。 “它只是在GNU通用公共许可证下才能用于教育和研究目的。”

你可以查看文档http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

但更具体的检查: http : //userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

你可以下载http://userweb.cs.utexas.edu/users/mooney/ir-course/

如果你仍然对这个问题感兴趣,我使用Lucene Java和Jython做了一些非常类似的事情。 这是我的代码中的一些片段。

Lucene使用所谓的分析器预处理文档和查询。 这个使用了Lucene的内置n-gramfilter:

 class NGramAnalyzer(Analyzer): '''Analyzer that yields n-grams for minlength <= n <= maxlength''' def __init__(self, minlength, maxlength): self.minlength = minlength self.maxlength = maxlength def tokenStream(self, field, reader): lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader)) return NGramTokenFilter(lower, self.minlength, self.maxlength) 

将一个ngrams列表转换为一个Document

 doc = Document() doc.add(Field('n-grams', ' '.join(ngrams), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES)) 

将文档存储在索引中:

 wr = IndexWriter(index_dir, NGramAnalyzer(), True, IndexWriter.MaxFieldLength.LIMITED) wr.addDocument(doc) 

构build查询有点困难,因为Lucene的QueryParser期望查询语言包含特殊的运算符,引号等,但是可以绕过( 这里部分解释)。