用NLTK创build一个新的语料库

我认为，我的题目的答案往往是去阅读文件，但我跑过NLTK书，但它没有给出答案。我对python很陌生。

我有一堆.txt文件，我希望能够使用NLTK为语料库nltk_data提供的语料库nltk_data 。

我试过PlaintextCorpusReader但我不能得到比：

 >>>import nltk >>>from nltk.corpus import PlaintextCorpusReader >>>corpus_root = './' >>>newcorpus = PlaintextCorpusReader(corpus_root, '.*') >>>newcorpus.words()

如何使用punkt分割新的句子句子？我尝试使用punkt函数，但punkt函数无法读取PlaintextCorpusReader类？

你还可以引导我如何将分段数据写入文本文件？

编辑：这个问题有一次赏金，它现在有第二个赏金。请参阅赏金箱中的文字。

我认为PlaintextCorpusReader已经用punkt分词器分割input，至less如果你的input语言是英语的话。

PlainTextCorpusReader的构造函数

 def __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(), sent_tokenizer=nltk.data.LazyLoader( 'tokenizers/punkt/english.pickle'), para_block_reader=read_blankline_block, encoding='utf8'):

你可以通过阅读器一个单词和句子标记，但对于后者，默认已经是nltk.data.LazyLoader('tokenizers/punkt/english.pickle') 。

对于单个string，将使用一个标记器，如下所述（在这里解释，punkt标记器的第5节）。

 >>> import nltk.data >>> text = """ ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... """ >>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') >>> tokenizer.tokenize(text.strip())

经过了几年的工作，这里是最新的教程

如何创build一个文本文件目录的NLTK语料库？

主要想法是使用nltk.corpus.reader包。如果你有一个英文文本文件的目录，最好使用PlaintextCorpusReader 。

如果你有一个如下所示的目录：

 newcorpus/ file1.txt file2.txt ...

简单地使用这些代码行，你可以得到一个语料库：

 import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader corpusdir = 'newcorpus/' # Directory of corpus. newcorpus = PlaintextCorpusReader(corpusdir, '.*')

注意： PlaintextCorpusReader将使用默认的nltk.tokenize.sent_tokenize()和nltk.tokenize.word_tokenize()将文本分割成句子和单词，这些function是为英语而build立的，它可能nltk.tokenize.word_tokenize()用于所有的语言。

以下是创buildtesting文本文件的完整代码，以及如何使用NLTK创build语料库以及如何访问不同级别的语料库：

 import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader # Let's create a corpus with 2 texts in different textfile. txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n""" corpus = [txt1,txt2] # Make new dir for the corpus. corpusdir = 'newcorpus/' if not os.path.isdir(corpusdir): os.mkdir(corpusdir) # Output the files into the directory. filename = 0 for text in corpus: filename+=1 with open(corpusdir+str(filename)+'.txt','w') as fout: print>>fout, text # Check that our corpus do exist and the files are correct. assert os.path.isdir(corpusdir) for infile, text in zip(sorted(os.listdir(corpusdir)),corpus): assert open(corpusdir+infile,'r').read().strip() == text.strip() # Create a new corpus by specifying the parameters # (1) directory of the new corpus # (2) the fileids of the corpus # NOTE: in this case the fileids are simply the filenames. newcorpus = PlaintextCorpusReader('newcorpus/', '.*') # Access each file in the corpus. for infile in sorted(newcorpus.fileids()): print infile # The fileids of each file. with newcorpus.open(infile) as fin: # Opens the file. print fin.read().strip() # Prints the content of the file print # Access the plaintext; outputs pure string/basestring. print newcorpus.raw().strip() print # Access paragraphs in the corpus. (list of list of list of strings) # NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and # nltk.tokenize.word_tokenize. # # Each element in the outermost list is a paragraph, and # Each paragraph contains sentence(s), and # Each sentence contains token(s) print newcorpus.paras() print # To access pargraphs of a specific fileid. print newcorpus.paras(newcorpus.fileids()[0]) # Access sentences in the corpus. (list of list of strings) # NOTE: That the texts are flattened into sentences that contains tokens. print newcorpus.sents() print # To access sentences of a specific fileid. print newcorpus.sents(newcorpus.fileids()[0]) # Access just tokens/words in the corpus. (list of strings) print newcorpus.words() # To access tokens of a specific fileid. print newcorpus.words(newcorpus.fileids()[0])

最后，要读取文本目录并用其他语言创build一个NLTK语料库，首先必须确保您有一个Python可调用的单词标记化和句子标记化模块，它们接受string/碱基stringinput并生成如下输出：

 >>> from nltk.tokenize import sent_tokenize, word_tokenize >>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" >>> sent_tokenize(txt1) ['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.'] >>> word_tokenize(sent_tokenize(txt1)[0]) ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

  >>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = './' >>> newcorpus = PlaintextCorpusReader(corpus_root, '.*') """ if the ./ dir contains the file my_corpus.txt, then you can view say all the words it by doing this """ >>> newcorpus.words('my_corpus.txt')

用NLTK创build一个新的语料库

在Corpus参数上的DocumentTermMatrix错误

以编程方式安装NLTK语料库/模型，即没有GUI下载器？

在NLTK / Python中使用电影评论语料库进行分类

如何通过python连接时更改默认的Mysql连接超时？