在lucene中获得两个文档之间的余弦相似度

我已经在Lucene中build立了一个索引。 我希望不指定查询,只是为了获得索引中两个文档之间的分数(余弦相似度或另一个距离?)。

例如,我从以前打开的索引阅读器IR与ID 2和4的文件。Document d1 = ir.document(2); 文件d2 = ir.document(4);

我怎样才能得到这两个文件之间的余弦相似?

谢谢

索引时,可以select存储词频vector。

在运行期间,使用IndexReader.getTermFreqVector()查找两个文档的术语频率vector,并使用IndexReader.docFreq()查找每个术语的文档频率数据。 这将为您提供计算两个文档之间的余弦相似性所需的所有组件。

更简单的方法可能是提交文档A作为查询(将所有单词添加到查询中作为或术语,按术语频率提高),并在结果集中查找文档B.

正如Julia指出的, Sujit Pal的例子非常有用,但Lucene 4 API有很大的变化。 这是一个为Lucene 4重写的版本。

import java.io.IOException; import java.util.*; import org.apache.commons.math3.linear.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.SimpleAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.util.*; public class CosineDocumentSimilarity { public static final String CONTENT = "Content"; private final Set<String> terms = new HashSet<>(); private final RealVector v1; private final RealVector v2; CosineDocumentSimilarity(String s1, String s2) throws IOException { Directory directory = createIndex(s1, s2); IndexReader reader = DirectoryReader.open(directory); Map<String, Integer> f1 = getTermFrequencies(reader, 0); Map<String, Integer> f2 = getTermFrequencies(reader, 1); reader.close(); v1 = toRealVector(f1); v2 = toRealVector(f2); } Directory createIndex(String s1, String s2) throws IOException { Directory directory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); IndexWriter writer = new IndexWriter(directory, iwc); addDocument(writer, s1); addDocument(writer, s2); writer.close(); return directory; } /* Indexed, tokenized, stored. */ public static final FieldType TYPE_STORED = new FieldType(); static { TYPE_STORED.setIndexed(true); TYPE_STORED.setTokenized(true); TYPE_STORED.setStored(true); TYPE_STORED.setStoreTermVectors(true); TYPE_STORED.setStoreTermVectorPositions(true); TYPE_STORED.freeze(); } void addDocument(IndexWriter writer, String content) throws IOException { Document doc = new Document(); Field field = new Field(CONTENT, content, TYPE_STORED); doc.add(field); writer.addDocument(doc); } double getCosineSimilarity() { return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm()); } public static double getCosineSimilarity(String s1, String s2) throws IOException { return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity(); } Map<String, Integer> getTermFrequencies(IndexReader reader, int docId) throws IOException { Terms vector = reader.getTermVector(docId, CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); Map<String, Integer> frequencies = new HashMap<>(); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int freq = (int) termsEnum.totalTermFreq(); frequencies.put(term, freq); terms.add(term); } return frequencies; } RealVector toRealVector(Map<String, Integer> map) { RealVector vector = new ArrayRealVector(terms.size()); int i = 0; for (String term : terms) { int value = map.containsKey(term) ? map.get(term) : 0; vector.setEntry(i++, value); } return (RealVector) vector.mapDivide(vector.getL1Norm()); } } 

我知道问题已经得到解答,但是对于未来可能来的人来说,解决scheme的好例子可以在这里find:

http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html

这是Mark Butler的一个很好的解决scheme,但是tf / idf权重的计算是错误的!

术语 – 频率(tf):这个术语在这个文件中出现了多less(不是所有的文档都和termsEnum.totalTermFreq()中的代码一样)。

文档频率(df):此术语出现的文档总数。

反文件频率:idf = log(N / df),其中N是文件总数。

Tf / idf weight = tf * idf,给定的期限和给定的文件。

我希望使用Lucene进行高效的计算! 我无法find正确的if / idf权重的有效计算。

编辑 :我做这个代码来计算权重作为tf / idf的权重,而不是纯粹的术语频率。 它工作得很好,但我想知道是否有更有效的方法。

 import java.io.IOException; import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Set; import org.apache.commons.math3.linear.ArrayRealVector; import org.apache.commons.math3.linear.RealVector; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.DocsEnum; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.Terms; import org.apache.lucene.index.TermsEnum; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.Version; public class CosineSimeTest { public static void main(String[] args) { try { CosineSimeTest cosSim = new CosineSimeTest( "This is good", "This is good" ); System.out.println( cosSim.getCosineSimilarity() ); } catch (IOException e) { e.printStackTrace(); } } public static final String CONTENT = "Content"; public static final int N = 2;//Total number of documents private final Set<String> terms = new HashSet<>(); private final RealVector v1; private final RealVector v2; CosineSimeTest(String s1, String s2) throws IOException { Directory directory = createIndex(s1, s2); IndexReader reader = DirectoryReader.open(directory); Map<String, Double> f1 = getWieghts(reader, 0); Map<String, Double> f2 = getWieghts(reader, 1); reader.close(); v1 = toRealVector(f1); System.out.println( "V1: " +v1 ); v2 = toRealVector(f2); System.out.println( "V2: " +v2 ); } Directory createIndex(String s1, String s2) throws IOException { Directory directory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); IndexWriter writer = new IndexWriter(directory, iwc); addDocument(writer, s1); addDocument(writer, s2); writer.close(); return directory; } /* Indexed, tokenized, stored. */ public static final FieldType TYPE_STORED = new FieldType(); static { TYPE_STORED.setIndexed(true); TYPE_STORED.setTokenized(true); TYPE_STORED.setStored(true); TYPE_STORED.setStoreTermVectors(true); TYPE_STORED.setStoreTermVectorPositions(true); TYPE_STORED.freeze(); } void addDocument(IndexWriter writer, String content) throws IOException { Document doc = new Document(); Field field = new Field(CONTENT, content, TYPE_STORED); doc.add(field); writer.addDocument(doc); } double getCosineSimilarity() { double dotProduct = v1.dotProduct(v2); System.out.println( "Dot: " + dotProduct); System.out.println( "V1_norm: " + v1.getNorm() + ", V2_norm: " + v2.getNorm() ); double normalization = (v1.getNorm() * v2.getNorm()); System.out.println( "Norm: " + normalization); return dotProduct / normalization; } Map<String, Double> getWieghts(IndexReader reader, int docId) throws IOException { Terms vector = reader.getTermVector(docId, CONTENT); Map<String, Integer> docFrequencies = new HashMap<>(); Map<String, Integer> termFrequencies = new HashMap<>(); Map<String, Double> tf_Idf_Weights = new HashMap<>(); TermsEnum termsEnum = null; DocsEnum docsEnum = null; termsEnum = vector.iterator(termsEnum); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int docFreq = termsEnum.docFreq(); docFrequencies.put(term, reader.docFreq( new Term( CONTENT, term ) )); docsEnum = termsEnum.docs(null, null); while (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { termFrequencies.put(term, docsEnum.freq()); } terms.add(term); } for ( String term : docFrequencies.keySet() ) { int tf = termFrequencies.get(term); int df = docFrequencies.get(term); double idf = ( 1 + Math.log(N) - Math.log(df) ); double w = tf * idf; tf_Idf_Weights.put(term, w); //System.out.printf("Term: %s - tf: %d, df: %d, idf: %f, w: %f\n", term, tf, df, idf, w); } System.out.println( "Printing docFrequencies:" ); printMap(docFrequencies); System.out.println( "Printing termFrequencies:" ); printMap(termFrequencies); System.out.println( "Printing if/idf weights:" ); printMapDouble(tf_Idf_Weights); return tf_Idf_Weights; } RealVector toRealVector(Map<String, Double> map) { RealVector vector = new ArrayRealVector(terms.size()); int i = 0; double value = 0; for (String term : terms) { if ( map.containsKey(term) ) { value = map.get(term); } else { value = 0; } vector.setEntry(i++, value); } return vector; } public static void printMap(Map<String, Integer> map) { for ( String key : map.keySet() ) { System.out.println( "Term: " + key + ", value: " + map.get(key) ); } } public static void printMapDouble(Map<String, Double> map) { for ( String key : map.keySet() ) { System.out.println( "Term: " + key + ", value: " + map.get(key) ); } } } 

计算Lucene 4.x版中的余弦相似度与3.x版中的不一样。 下面的文章详细解释了在Lucene 4.10.2中计算余弦相似性的所有必要的代码。 ComputerGodzilla:计算在Lucene中的余弦相似性 !

你可以find更好的解决scheme@ http://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 。 以下是步骤

  • 在Lucene的帮助下从内容中构build术语向量的java代码(检查: http : //lucene.apache.org/core/ )。
  • 通过在两个文档之间使用commons-math.jar库余弦计算完成。

如果您不需要将文档存储到Lucene中,只需要计算两个文档之间的相似度,则可以使用更快的代码(Scala,来自我的博客http://chepurnoy.org/blog/2014/03/faster-cosine-similarity在两个dicuments-with-scala-and-lucene / )之间

 def extractTerms(content: String): Map[String, Int] = { val analyzer = new StopAnalyzer(Version.LUCENE_46) val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content)) val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute]) val m = scala.collection.mutable.Map[String, Int]() ts.reset() while (ts.incrementToken()) { val term = charTermAttribute.toString val newCount = m.get(term).map(_ + 1).getOrElse(1) m += term -> newCount } m.toMap } def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = { //word, t1 freq, t2 freq val m = scala.collection.mutable.HashMap[String, (Int, Int)]() val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) => m += word ->(freq, 0) sum + freq } val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) => m.get(word) match { case Some((freq1, _)) => m += word ->(freq1, freq) case None => m += word ->(0, freq) } sum + freq } val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) => val fs = e._2 val f1 = fs._1 / sum1 val f2 = fs._2 / sum2 (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2) } val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3)) cos } 

因此,要计算text1和text2之间的相似度,只需调用similarity(extractTerms(text1), extractTerms(text2))