从一个句子生成N-gram

如何生成一个string的n元组,​​如:

String Input="This is my car." 

我想用这个input生成n-gram:

 Input Ngram size = 3 

输出应该是:

 This is my car This is is my my car This is my is my car 

在Java中给出一些想法,如何实现它,或者是否有可用的库。

我正在尝试使用这个NGramTokenizer,但它给出了n-gram的字符序列,我想要n-gram的字序列。

您正在寻找ShingleFilter 。

更新:链接指向版本3.0.2。 这个类可能在Lucene的新版本中有不同的包。

我相信这会做你想要的:

 import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i = start; i < end; i++) sb.append((i > start ? " " : "") + words[i]); return sb.toString(); } public static void main(String[] args) { for (int n = 1; n <= 3; n++) { for (String ngram : ngrams(n, "This is my car.")) System.out.println(ngram); System.out.println(); } } } 

输出:

 This is my car. This is is my my car. This is my is my car. 

作为迭代器实现的“按需”解决scheme:

 class NgramIterator implements Iterator<String> { String[] words; int pos = 0, n; public NgramIterator(int n, String str) { this.n = n; words = str.split(" "); } public boolean hasNext() { return pos < words.length - n + 1; } public String next() { StringBuilder sb = new StringBuilder(); for (int i = pos; i < pos + n; i++) sb.append((i > pos ? " " : "") + words[i]); pos++; return sb.toString(); } public void remove() { throw new UnsupportedOperationException(); } } 

此代码返回给定长度的所有string的数组:

 public static String[] ngrams(String s, int len) { String[] parts = s.split(" "); String[] result = new String[parts.length - len + 1]; for(int i = 0; i < parts.length - len + 1; i++) { StringBuilder sb = new StringBuilder(); for(int k = 0; k < len; k++) { if(k > 0) sb.append(' '); sb.append(parts[i+k]); } result[i] = sb.toString(); } return result; } 

例如

 System.out.println(Arrays.toString(ngrams("This is my car", 2))); //--> [This is, is my, my car] System.out.println(Arrays.toString(ngrams("This is my car", 3))); //--> [This is my, is my car] 
 /** * * @param sentence should has at least one string * @param maxGramSize should be 1 at least * @return set of continuous word n-grams up to maxGramSize from the sentence */ public static List<String> generateNgramsUpto(String str, int maxGramSize) { List<String> sentence = Arrays.asList(str.split("[\\W+]")); List<String> ngrams = new ArrayList<String>(); int ngramSize = 0; StringBuilder sb = null; //sentence becomes ngrams for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) { String word = (String) it.next(); //1- add the word itself sb = new StringBuilder(word); ngrams.add(word); ngramSize=1; it.previous(); //2- insert prevs of the word and add those too while(it.hasPrevious() && ngramSize<maxGramSize){ sb.insert(0,' '); sb.insert(0,it.previous()); ngrams.add(sb.toString()); ngramSize++; } //go back to initial position while(ngramSize>0){ ngramSize--; it.next(); } } return ngrams; } 

呼叫:

 long startTime = System.currentTimeMillis(); ngrams = ToolSet.generateNgramsUpto("This is my car.", 3); long stopTime = System.currentTimeMillis(); System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size()); System.out.println(ngrams.toString()); 

输出:

我的时间= 1毫秒ngramsize = 9 [这是,我的,是我的,这是我的车,我的车,是我的车]

  public static void CreateNgram(ArrayList<String> list, int cutoff) { try { NGramModel ngramModel = new NGramModel(); POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); perfMon.start(); for(int i = 0; i<list.size(); i++) { String inputString = list.get(i); ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString)); String line; while ((line = lineStream.read()) != null) { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); perfMon.incrementCounter(); String words[] = sample.getSentence(); if(words.length > 0) { for(int k = 2; k< 4; k++) { ngramModel.add(new StringList(words), k, k); } } } } ngramModel.cutoff(cutoff, Integer.MAX_VALUE); Iterator<StringList> it = ngramModel.iterator(); while(it.hasNext()) { StringList strList = it.next(); System.out.println(strList.toString()); } perfMon.stopAndPrintFinalResult(); }catch(Exception e) { System.out.println(e.toString()); } } 

这是我创buildn-gram的代码。 在这种情况下,n = 2,3.小于截止值的n-gram字序列将从结果集中忽略。 input是句子列表,然后使用OpenNLP工具parsing

 public static void main(String[] args) { String[] words = "This is my car.".split(" "); for (int n = 0; n < 3; n++) { List<String> list = ngrams(n, words); for (String ngram : list) { System.out.println(ngram); } System.out.println(); } } public static List<String> ngrams(int stepSize, String[] words) { List<String> ngrams = new ArrayList<String>(); for (int i = 0; i < words.length-stepSize; i++) { String initialWord = ""; int internalCount = i; int internalStepSize = i + stepSize; while (internalCount <= internalStepSize && internalCount < words.length) { initialWord = initialWord+" " + words[internalCount]; ++internalCount; } ngrams.add(initialWord); } return ngrams; }