python四克,五克,六克?

我正在寻找一种将文本分成n-gram的方法。 通常我会做这样的事情:

import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams 

我知道nltk只提供bigrams和trigrams,但是有没有办法把文本分成四克,五克甚至几百克?

谢谢!

其他用户给出的很好的原生python的答案。 但是,这里是NLTK方法(以防万一,因为重新创buildNLTK库中已有的内容而受到惩罚)。

人们很less在NLTK中使用ngram模块( http://www.nltk.org/_modules/nltk/model/ngram.html )。 这不是因为很难读取ngram,而是通过训练一个基于> 3g的模型会导致很多数据稀疏。

 from nltk import ngrams sentence = 'this is a foo bar sentences and i want to ngramize it' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams 

我很惊讶,这还没有显示出来:

 In [34]: sentence = "I really like python, it's pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram ['I', 'really', 'like', 'python,'] ['really', 'like', 'python,', "it's"] ['like', 'python,', "it's", 'pretty'] ['python,', "it's", 'pretty', 'awesome.'] 

这是做n-gram的另一个简单方法

 >>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.word_tokenize(text) >>> tokenize ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] >>> bigrams = ngrams(tokenize,2) >>> bigrams [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')] 

只使用nltk工具

 from nltk.tokenize import word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(word_tokenize(text), n) return [ ' '.join(grams) for grams in n_grams] 

输出示例

 get_ngrams('This is the simplest text i could think of', 3 ) ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of'] 

为了保持数组格式的ngram,只需删除' '.join

你可以使用itertools轻松地创build你自己的函数:

 from itertools import izip, islice, tee s = 'spam and eggs' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '), # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'), # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'), # ('e', 'g', 'g'), ('g', 'g', 's')] 

我从来没有处理nltk,但作为一些小class的一部分,做了N-gram。 如果你想findstring中出现的所有N-gram的频率,这是一种方法。 D会给你N字的直方图。

 D = dict() string = 'whatever string...' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[tuple(strparts[i:i+N])] += 1 except: D[tuple(strparts[i:i+N])] = 1 

对于four_grams,它已经在NLTK中了 ,下面是一段代码,可以帮助你实现这个目的:

  from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq 

我希望它有帮助。

用python的内置zip()构buildbigrams的更优雅的方法。 只需通过split()将原始string转换为列表,然后正常传递一次列表,并一次偏移一个元素。

 string = "I really like python, it's pretty awesome." def find_bigrams(s): input_list = s.split(" ") return zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')] 

你可以使用sklearn.feature_extraction.text.CountVectorizer :

 import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) 

输出:

 4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it'] 

您可以将ngram_size设置为任何正整数。 也就是说,你可以把文本分成四克,五克甚至几百克。

Nltk很棒,但有时候对于一些项目来说是一个开销:

 import re def tokenize(text, ngrams=1): text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text) text = re.sub(r'\s+', ' ', text) tokens = text.split() return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)] 

使用示例:

 >> text = "This is an example text" >> tokenize(text, 2) [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')] >> tokenize(text, 3) [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]