在Corpus参数上的DocumentTermMatrix错误

我有以下代码:

# returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here 

当我运行DocumentTermMatrix()方法时,它给了我这个错误:

错误:inheritance(doc,“TextDocument”)不是TRUE

为什么我得到这个错误? 我的行不是文本文件?

这是检查corpus_clean的输出:

 [[153]] [1] obama holds technical school model us [[154]] [1] oil boom produces jobs bonanza archaeologists [[155]] [1] islamic terrorist group expands territory captures tikrit [[156]] [1] republicans democrats feel eric cantors loss [[157]] [1] tea party candidates try build cantor loss [[158]] [1] vehicles materials stored delaware bridges [[159]] [1] hill testimony hagel defends bergdahl trade [[160]] [1] tweet selfpropagates tweetdeck [[161]] [1] blackwater guards face trial iraq shootings [[162]] [1] calif man among soldiers killed afghanistan [[163]] [1] stocks fall back world bank cuts growth outlook [[164]] [1] jabhat alnusra longer useful turkey [[165]] [1] catholic bishops keep focus abortion marriage [[166]] [1] barbra streisand visits hill heart disease [[167]] [1] rand paul cantors loss reason stop talking immigration [[168]] [1] israeli airstrike kills northern gaza 

编辑:这是我的数据:

 type,text neutral,The week in 32 photos neutral,Look at me! 22 selfies of the week neutral,Inside rebel tunnels in Homs neutral,Voices from Ukraine neutral,Water dries up ahead of World Cup positive,Who's your hero? Nominate them neutral,Anderson Cooper: Here's how positive,"At fire scene, she rescues the pet" neutral,Hunger in the land of plenty positive,Helping women escape 'the life' neutral,A tour of the sex underworld neutral,Miss Universe Thailand steps down neutral,China's 'naked officials' crackdown negative,More held over Pakistan stoning neutral,Watch landmark Cold War series neutral,In photos: History of the Cold War neutral,Turtle predicts World Cup winner neutral,What devoured great white? positive,Nun wins Italy's 'The Voice' neutral,Bride Price app sparks debate neutral,China to deport 'pork' artist negative,Lightning hits moving car neutral,Singer won't be silenced neutral,Poland's mini desert neutral,When monarchs retire negative,Murder on Street View? positive,Meet armless table tennis champ neutral,Incredible 400 year-old globes positive,Man saves falling baby neutral,World's most controversial foods 

我检索像:

 news_raw <- read.csv('news_csv.csv', stringsAsFactors = F) 

编辑:这是traceback():

 > news_dtm <- DocumentTermMatrix(corpus_clean) Error: inherits(doc, "TextDocument") is not TRUE > traceback() 9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 8: stopifnot(inherits(doc, "TextDocument"), is.list(control)) 7: FUN(X[[1L]], ...) 6: lapply(X, FUN, ...) 5: mclapply(unname(content(x)), termFreq, control) 4: TermDocumentMatrix.VCorpus(x, control) 3: TermDocumentMatrix(x, control) 2: t(TermDocumentMatrix(x, control)) 1: DocumentTermMatrix(corpus_clean) 

当我评估inherits(corpus_clean, "TextDocument")它是FALSE。

看起来这个在tm 0.5.10工作得很好,但tm 0.6.0变化似乎已经打破了。 问题是,函数tolowertrim不一定会返回TextDocuments(它看起来像旧版本可能已经自动完成转换)。 他们反而返回字符和DocumentTermMatrix不知道如何处理字符的语料库。

所以你可以改变

 corpus_clean <- tm_map(news_corpus, content_transformer(tolower)) 

或者你可以运行

 corpus_clean <- tm_map(corpus_clean, PlainTextDocument) 

在所有非标准转换之后(不在getTransformations() )完成之后,并在创buildDocumentTermMatrix之前完成。 这应该确保您的所有数据都在PlainTextDocument中,并且应该使DocumentTermMatrix变得快乐。

我在一篇关于TM的文章中find了解决这个问题的方法。

一个错误如下的例子:

 getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") # import files corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,removePunctuation) corpus <- tm_map(corpus,stripWhitespace) corpus <- tm_map(corpus,removePunctuation); matrix_terms <- DocumentTermMatrix(corpus) 

警告消息:

在TermDocumentMatrix.VCorpus(x,control)中:无效的文档标识符

发生此错误是因为您需要类Vector Source的对象来执行术语文档matrix,但是之前的转换会转换字符中文本的语料库,因此会更改不被该函数接受的类。

但是,如果在tm_map命令中添加函数content_transformer ,那么在使用函数TermDocumentMatrix继续前,可能不需要再多一个命令。

下面的代码更改类(请参阅第二行)并避免错误:

 getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- tm_map(corpus,content_transformer(stripWhitespace)) corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- Corpus(VectorSource(corpus)) # change class matrix_term <- DocumentTermMatrix(corpus) 

改变这个:

 corpus_clean <- tm_map(news_corpus, tolower) 

为了这:

 corpus_clean <- tm_map(news_corpus, content_transformer(tolower)) 

这应该工作。

 remove.packages(tm) install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL) library(tm)