Tag: 语料库

以编程方式安装NLTK语料库/模型，即没有GUI下载器？: 我的项目使用NLTK。如何列出项目的语料库和模型要求，以便自动安装？我不想单击nltk.download() GUI，逐个安装软件包。此外，任何方式冻结相同的要求列表（如点击pip freeze ）？

在Corpus参数上的DocumentTermMatrix错误: 我有以下代码： # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here […]

如何通过python连接时更改默认的Mysql连接超时？: 我连接到一个MySQL数据库使用python con = _mysql.connect('localhost', 'dell-pc', '', 'test')我写的程序需要很多时间才能完全执行，即大约10个小时。其实，我正在试图从一个语料库中读出不同的单词。阅读完成后有一个超时错误。我检查了MySQL的默认超时值是： +—————————-+———-+ | Variable_name | Value | +—————————-+———-+ | connect_timeout | 10 | | delayed_insert_timeout | 300 | | innodb_lock_wait_timeout | 50 | | innodb_rollback_on_timeout | OFF | | interactive_timeout | 28800 | | lock_wait_timeout | 31536000 | | net_read_timeout | 30 | | net_write_timeout | […]

在NLTK / Python中使用电影评论语料库进行分类: 我期待在NLTK第6章中做一些分类。这本书似乎跳过了创build类别的一步，我不知道我做错了什么。我有我的脚本在这里的回应如下。我的问题主要来自第一部分 – 基于目录名称的类别创build。这里的一些其他问题已经使用了文件名（即pos_1.txt和neg_1.txt ），但我更喜欢创build可以转储文件的目录。 from nltk.corpus import movie_reviews reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt') reviews.categories() ['pos', 'neg'] documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] all_words=nltk.FreqDist( w.lower() for w in movie_reviews.words() if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation) word_features = all_words.keys()[:100] def document_features(document): document_words = […]

用NLTK创build一个新的语料库: 我认为，我的题目的答案往往是去阅读文件，但我跑过NLTK书，但它没有给出答案。我对python很陌生。我有一堆.txt文件，我希望能够使用NLTK为语料库nltk_data提供的语料库nltk_data 。我试过PlaintextCorpusReader但我不能得到比： >>>import nltk >>>from nltk.corpus import PlaintextCorpusReader >>>corpus_root = './' >>>newcorpus = PlaintextCorpusReader(corpus_root, '.*') >>>newcorpus.words() 如何使用punkt分割新的句子句子？我尝试使用punkt函数，但punkt函数无法读取PlaintextCorpusReader类？你还可以引导我如何将分段数据写入文本文件？编辑：这个问题有一次赏金，它现在有第二个赏金。请参阅赏金箱中的文字。