Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are some powerful tools for text manipulation and pre-processing in R?

Tags:

r

sed

awk

I frequently use Hadley's package stringr to clean up messy ecological data (normalizing species names, poorly formatted labels etc). Recently I began learning sed and awk and am blown away by how powerful those tools are, especially when dealing with numerous data files.

My questions:

  1. Are there other powerful text handling packages (outside of base functions, and those in stringr) that would be useful for data cleaning?

  2. Would it be possible to run sed commands/scripts from within R? If so, how? Can you give me an example?

  3. Has anyone attempted to write a wrapper for sed as a R package. If not, would that be a something worth pursuing (A side project for myself or more competent programmers)?

like image 699
Maiasaura Avatar asked Nov 13 '11 22:11

Maiasaura


People also ask

What is text processing in R?

R has a rich set of packages for Natural Language Processing (NLP) and generating plots. The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis.

Which library is used for text mining in R?

One very useful library to perform the aforementioned steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.

What package is required for text analysis in R?

The All-Encompassing: Quanteda Quanteda is the go-to package for quantitative text analysis. Developed by Kenneth Benoit and other contributors, this package is a must for any data scientist doing text analysis.


1 Answers

First, regarding sed and awk, I've not generally had a need for them, as they're particularly old school. I often write regular expressions in Perl, and achieve the same thing, with somewhat easier readability. I don't mean to debate the merits of implementation, but when I'm not writing such functions in Perl, I find that gsub, grep, and related regular expression tools work quite well in R. Note that these can take perl = TRUE as an argument; I prefer Perl regex handling.

Regarding much more serious packages, the tm package is particularly notable. For more coverage of natural language processing and text mining resources, check out the CRAN Task View for NLP.

Also, I think your question title has conflated two concepts. Tools like sed & awk, regular expressions, tokenization, etc. are important pieces in text manipulation and pre-processing. Text mining is more statistical, and depends upon the effective pre-processing and quantification of the text data. Although not mentioned, two subsequent stages of analyses, information retrieval and natural language processing, are research & engineering areas that are more specific in their aims. If you're primarily interested in text manipulation, then the various tools for applying regular expressions and pre-processing / normalization should suffice. If you want to do text mining, you'll need to look into the more statistical functions. For NLP, then tools that do a bit deeper analyses will be necessary. All are accessible from within R, but the question is how far do you want to go down this rabbit hole? Wanna swallow the red pill?

like image 171
Iterator Avatar answered Oct 04 '22 04:10

Iterator