python四克，五克，六克？

我正在寻找一种将文本分成n-gram的方法。通常我会做这样的事情：

import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams

我知道nltk只提供bigrams和trigrams，但是有没有办法把文本分成四克，五克甚至几百克？

谢谢！

其他用户给出的很好的原生python的答案。但是，这里是NLTK方法（以防万一，因为重新创buildNLTK库中已有的内容而受到惩罚）。

人们很less在NLTK中使用ngram模块（ http://www.nltk.org/_modules/nltk/model/ngram.html ）。这不是因为很难读取ngram，而是通过训练一个基于> 3g的模型会导致很多数据稀疏。

 from nltk import ngrams sentence = 'this is a foo bar sentences and i want to ngramize it' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams

我很惊讶，这还没有显示出来：

 In [34]: sentence = "I really like python, it's pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram ['I', 'really', 'like', 'python,'] ['really', 'like', 'python,', "it's"] ['like', 'python,', "it's", 'pretty'] ['python,', "it's", 'pretty', 'awesome.']

这是做n-gram的另一个简单方法

 >>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.word_tokenize(text) >>> tokenize ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] >>> bigrams = ngrams(tokenize,2) >>> bigrams [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

只使用nltk工具

 from nltk.tokenize import word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(word_tokenize(text), n) return [ ' '.join(grams) for grams in n_grams]

输出示例

 get_ngrams('This is the simplest text i could think of', 3 ) ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了保持数组格式的ngram，只需删除' '.join

你可以使用itertools轻松地创build你自己的函数：

 from itertools import izip, islice, tee s = 'spam and eggs' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '), # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'), # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'), # ('e', 'g', 'g'), ('g', 'g', 's')]

我从来没有处理nltk，但作为一些小class的一部分，做了N-gram。如果你想findstring中出现的所有N-gram的频率，这是一种方法。 D会给你N字的直方图。

 D = dict() string = 'whatever string...' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[tuple(strparts[i:i+N])] += 1 except: D[tuple(strparts[i:i+N])] = 1

对于four_grams，它已经在NLTK中了，下面是一段代码，可以帮助你实现这个目的：

  from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq

我希望它有帮助。

用python的内置zip()构buildbigrams的更优雅的方法。只需通过split()将原始string转换为列表，然后正常传递一次列表，并一次偏移一个元素。

 string = "I really like python, it's pretty awesome." def find_bigrams(s): input_list = s.split(" ") return zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

你可以使用sklearn.feature_extraction.text.CountVectorizer ：

 import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出：

 4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

您可以将ngram_size设置为任何正整数。也就是说，你可以把文本分成四克，五克甚至几百克。

Nltk很棒，但有时候对于一些项目来说是一个开销：

 import re def tokenize(text, ngrams=1): text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text) text = re.sub(r'\s+', ' ', text) tokens = text.split() return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

使用示例：

 >> text = "This is an example text" >> tokenize(text, 2) [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')] >> tokenize(text, 3) [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

python四克，五克，六克？

在Python中简单实现N-Gram，tf-idf和余弦相似性

Python：减less字典的内存使用

从一个句子生成N-gram

使用ElasticSearchsearch文件名

Elasticsearch：查找子串匹配