用于情感分析的NaiveBayesClassifier培训

我正在Python中使用句子来训练NaiveBayesClassifier ,它给了我下面的错误。 我不明白这个错误是什么,任何帮助都是好的。

我已经尝试了许多其他input格式,但错误仍然存​​在。 代码如下:

 from text.classifiers import NaiveBayesClassifier from text.blob import TextBlob train = [('I love this sandwich.', 'pos'), ('This is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('This is my best work.', 'pos'), ("What an awesome view", 'pos'), ('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'), ('He is my sworn enemy!', 'neg'), ('My boss is horrible.', 'neg') ] test = [('The beer was good.', 'pos'), ('I do not enjoy my job', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel amazing!", 'pos'), ('Gary is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'neg') ] classifier = nltk.NaiveBayesClassifier.train(train) 

我包括下面的追踪。

 Traceback (most recent call last): File "C:\Users\5460\Desktop\train01.py", line 15, in <module> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0])) File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0])) File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize return _word_tokenize(text) File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize text = re.sub(r'^\"', r'``', text) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer 

你需要改变你的数据结构。 这是你目前的train清单:

 >>> train = [('I love this sandwich.', 'pos'), ('This is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('This is my best work.', 'pos'), ("What an awesome view", 'pos'), ('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'), ('He is my sworn enemy!', 'neg'), ('My boss is horrible.', 'neg')] 

问题是,每个元组的第一个元素需要是可散列的。 string不可排除。 所以我会把你的列表变成分类器可以使用的数据结构:

 >>> from nltk.tokenize import word_tokenize # or use some other tokenizer >>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0])) >>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train] 

你的数据现在应该像这样构造:

 >>> t [({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .] 

请注意,每个元组的第一个元素现在是一个可散列的字典。 现在你的数据已经到位了,每个元组的第一个元素是可散列的,你可以像这样训练分类器:

 >>> import nltk >>> classifier = nltk.NaiveBayesClassifier.train(t) >>> classifier.show_most_informative_features() Most Informative Features this = True neg : pos = 2.3 : 1.0 this = False pos : neg = 1.8 : 1.0 an = False neg : pos = 1.6 : 1.0 . = True pos : neg = 1.4 : 1.0 . = False neg : pos = 1.4 : 1.0 awesome = False neg : pos = 1.2 : 1.0 of = False pos : neg = 1.2 : 1.0 feel = False neg : pos = 1.2 : 1.0 place = False neg : pos = 1.2 : 1.0 horrible = False pos : neg = 1.2 : 1.0 

如果你想使用分类器,你可以这样做。 首先,你从一个testing语句开始:

 >>> test_sentence = "This is the best band I've ever heard!" 

然后,你将这个句子标记出来,并找出这个句子和all_words共享哪个单词。 这些构成了句子的特征。

 >>> test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words} 

你的function现在看起来像这样:

 >>> test_sent_features {'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False} 

那么你只需要对这些特征进行分类:

 >>> classifier.classify(test_sent_features) 'pos' # note 'best' == True in the sentence features above 

这个testing句似乎是肯定的。

@ 275365的NLTK贝叶斯分类器的数据结构教程是伟大的。 从更高的层面来看,我们可以把它看作是,

我们有情绪标签input句子:

 training_data = [('I love this sandwich.', 'pos'), ('This is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('This is my best work.', 'pos'), ("What an awesome view", 'pos'), ('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'), ('He is my sworn enemy!', 'neg'), ('My boss is horrible.', 'neg')] 

让我们考虑我们的特征集是单个单词,所以我们从训练数据中提取所有可能单词的列表(我们称之为词汇表):

 from nltk.tokenize import word_tokenize from itertools import chain vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data])) 

基本上,这里的vocabulary是相同的@ 275365的all_word

 >>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0])) >>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data])) >>> print vocabulary == all_words True 

从每个数据点(即每个句子和pos / neg标签),我们想要说出一个特征(即词汇中的一个词)是否存在。

 >>> sentence = word_tokenize('I love this sandwich.'.lower()) >>> print {i:True for i in vocabulary if i in sentence} {'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True} 

但是我们也想告诉分类器哪个单词不存在于句子中,而是存在于词汇表中,因此,对于每个数据点,我们列出词汇表中所有可能的单词,并说出一个单词是否存在:

 >>> sentence = word_tokenize('I love this sandwich.'.lower()) >>> x = {i:True for i in vocabulary if i in sentence} >>> y = {i:False for i in vocabulary if i not in sentence} >>> x.update(y) >>> print x {'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False} 

但是,由于这个循环遍历词汇两次,这样做更有效率:

 >>> sentence = word_tokenize('I love this sandwich.'.lower()) >>> x = {i:(i in sentence) for i in vocabulary} {'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False} 

因此,对于每个句子,我们要告诉分类器每个句子存在哪个单词,哪个单词不存在,并给它pos / neg标签。 我们可以称之为feature_set ,它是一个由x (如上所示)及其标记组成的元组。

 >>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data] [({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...] 

然后,我们将这些feature_set中的特征和标签提供给分类器来训练它:

 from nltk import NaiveBayesClassifier as nbc classifier = nbc.train(feature_set) 

现在你有了一个训练有素的分类器,当你想要标记一个新的句子时,你必须对新句子进行“特征化”,以查看新句子中的哪个单词在分类器被训练的词汇表中:

 >>> test_sentence = "This is the best band I've ever heard! foobar" >>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary} 

注意:从上面的步骤可以看出,朴素贝叶斯分类器无法处理词汇表中的词汇,因为foobar记号在您使用之后会消失。

然后,将特征化的testing句join分类器并要求分类:

 >>> classifier.classify(featurized_test_sentence) 'pos' 

希望这可以更清楚地说明如何将数据提供给NLTK的朴素贝叶斯分类器进行感性分析。 以下是完整的代码,没有评论和演练:

 from nltk import NaiveBayesClassifier as nbc from nltk.tokenize import word_tokenize from itertools import chain training_data = [('I love this sandwich.', 'pos'), ('This is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('This is my best work.', 'pos'), ("What an awesome view", 'pos'), ('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'), ('He is my sworn enemy!', 'neg'), ('My boss is horrible.', 'neg')] vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data])) feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data] classifier = nbc.train(feature_set) test_sentence = "This is the best band I've ever heard!" featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary} print "test_sent:",test_sentence print "tag:",classifier.classify(featurized_test_sentence) 

看起来,您正在尝试使用TextBlob,但是正在训练NLTK NaiveBayesClassifier,正如其他答案中指出的,必须传递一个function字典。

TextBlob有一个默认的特征提取器,用于指示训练集中的哪些单词包含在文档中(如其他答案所示)。 因此,TextBlob允许您按原样传递您的数据。

 from textblob.classifiers import NaiveBayesClassifier train = [('This is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('This is my best work.', 'pos'), ("What an awesome view", 'pos'), ('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'), ('He is my sworn enemy!', 'neg'), ('My boss is horrible.', 'neg') ] test = [ ('The beer was good.', 'pos'), ('I do not enjoy my job', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel amazing!", 'pos'), ('Gary is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'neg') ] classifier = NaiveBayesClassifier(train) # Pass in data as is # When classifying text, features are extracted automatically classifier.classify("This is an amazing library!") # => 'pos' 

当然,简单的默认提取器并不适合所有的问题。 如果您想要提取特征,只需编写一个函数,将一串文本作为input,输出特征字典并将其传递给分类器。

 classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func) 

我鼓励你在这里查看这个简短的TextBlob分类器教程: http ://textblob.readthedocs.org/en/latest/classifiers.html