Thursday, December 01, 2016

RTextTools demo error


There is a nifty little "one-stop-shopping" text analytics package in R called "RTextTools" that was created by Timothy P. Jurka from UC-Davis.  The package allows you to do a whole variety of text classification algorithms automatically, and with ease.

However, it appears (and I cannot say this for sure, but that's what it seems like) that if you use it in R version 3.3.0 and later, there is a bug (at least in the demo).  (It apparently does not occur in earlier versions of R.)  The demo asks you to do the following:

library(RTextTools)
data(USCongress)

doc_matrix  &#60- create_matrix(USCongress$text,
                            language = "english", 
                            removeNumbers = TRUE, 
                            stemWords = TRUE, 
                            removeSparseTerms = .998)

The problem is that the following error occurs.

Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :  missing value where TRUE/FALSE needed

One astute user (lukeA) on Stack Overflow discovered that the characters " NA " in the strings in your text fields are converted to an actual R object NA.

http://stackoverflow.com/questions/38199396/stemming-words-in-r-missing-value

Therefore, in order to fix this with the demo, you have to eliminate the two records in USCongress that have an NA lurking in the text.  These are records 3674 and 3675.  Therefore, prior to the create_matrix statement, you can fix this with:

USCongress <- 3675="" b="" c="" uscongress="">

Once that line is there, the create_matrix field works, and you can continue with the demo.

Follow me on Twitter: @bioniclime