wordnet lemmatization和pos标签在python中

我想在python中使用wordnet lemmatizer，并且我已经了解到默认的pos标签是NOUN，并且它不会为动词输出正确的引理，除非pos标签显式指定为VERB。

我的问题是为了准确地进行上述的词性化，最好的办法是什么？

我使用nltk.pos_tag进行了pos标记，而且我正在将树库pos标记集成到wordnet兼容的pos标记中。请帮忙

 from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() tagged = nltk.pos_tag(tokens)

我得到NN，JJ，VB，RB中的输出标签。如何将这些更改为与wordnet兼容的标签？

还有，我必须训练nltk.pos_tag()带标签的语料库，或者我可以直接在我的数据上使用它来评估？

首先，你可以直接使用nltk.pos_tag()而不用训练它。该函数将从文件加载预训练标记器。您可以使用nltk.tag._POS_TAGGER查看文件名：

 nltk.tag._POS_TAGGER >>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

在使用Treebank语料库进行培训时，它也使用Treebank标签集。

下面的函数会将treebank标签映射到WordNet部分的语音名称：

 from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return ''

然后你可以使用lemmatizer的返回值：

 from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('going', wordnet.VERB) >>> 'go'

如在nltk.corpus.reader.wordnet的源代码（ http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html ）

 #{ Part-of-speech constants ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' #} POS_LIST = [NOUN, VERB, ADJ, ADV]

@Suzana_K正在工作。但是我有一些情况导致KeyError作为@ Clock Slave提到。

将树库标签转换为Wordnet标签

 from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return None # for easy if-statement

现在，只有有了networking标签，我们才把posinput到lemmatize函数中

 from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() tagged = nltk.pos_tag(tokens) for word, tag in tagged: wntag = get_wordnet_pos(tag) if wntag is None:# not supply tag in case of None lemma = lemmatizer.lemmatize(word) else: lemma = lemmatizer.lemmatize(word, pos=wntag)

步骤转换： 文档 – >句子 – >标记 – > POS->引文

 import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet #example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad' class Splitter(object): """ split the document into sentences and tokenize each sentence """ def __init__(self): self.splitter = nltk.data.load('tokenizers/punkt/english.pickle') self.tokenizer = nltk.tokenize.TreebankWordTokenizer() def split(self,text): """ out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.'] """ # split into single sentence sentences = self.splitter.tokenize(text) # tokenization in each sentences tokens = [self.tokenizer.tokenize(sent) for sent in sentences] return tokens class LemmatizationWithPOSTagger(object): def __init__(self): pass def get_wordnet_pos(self,treebank_tag): """ return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) """ if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: # As default pos in lemmatization is Noun return wordnet.NOUN def pos_tag(self,tokens): # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') .... pos_tokens = [nltk.pos_tag(token) for token in tokens] # lemmatization using pos tagg # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag] pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens] return pos_tokens lemmatizer = WordNetLemmatizer() splitter = Splitter() lemmatization_using_pos_tagger = LemmatizationWithPOSTagger() #step 1 split document into sentence followed by tokenization tokens = splitter.split(text) #step 2 lemmatization using pos tagger lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens) print(lemma_pos_token)

wordnet lemmatization和pos标签在python中

Stemmers vs Lemmatizers