What are some powerful tools for text manipulation and pre-processing in R?

Tags:

I frequently use Hadley's package stringr to clean up messy ecological data (normalizing species names, poorly formatted labels etc). Recently I began learning sed and awk and am blown away by how powerful those tools are, especially when dealing with numerous data files.

My questions:

Are there other powerful text handling packages (outside of base functions, and those in stringr) that would be useful for data cleaning?
Would it be possible to run sed commands/scripts from within R? If so, how? Can you give me an example?
Has anyone attempted to write a wrapper for sed as a R package. If not, would that be a something worth pursuing (A side project for myself or more competent programmers)?

699

asked Nov 13 '11 22:11

Maiasaura

1 Answers

First, regarding sed and awk, I've not generally had a need for them, as they're particularly old school. I often write regular expressions in Perl, and achieve the same thing, with somewhat easier readability. I don't mean to debate the merits of implementation, but when I'm not writing such functions in Perl, I find that gsub, grep, and related regular expression tools work quite well in R. Note that these can take perl = TRUE as an argument; I prefer Perl regex handling.

Regarding much more serious packages, the tm package is particularly notable. For more coverage of natural language processing and text mining resources, check out the CRAN Task View for NLP.

Also, I think your question title has conflated two concepts. Tools like sed & awk, regular expressions, tokenization, etc. are important pieces in text manipulation and pre-processing. Text mining is more statistical, and depends upon the effective pre-processing and quantification of the text data. Although not mentioned, two subsequent stages of analyses, information retrieval and natural language processing, are research & engineering areas that are more specific in their aims. If you're primarily interested in text manipulation, then the various tools for applying regular expressions and pre-processing / normalization should suffice. If you want to do text mining, you'll need to look into the more statistical functions. For NLP, then tools that do a bit deeper analyses will be necessary. All are accessible from within R, but the question is how far do you want to go down this rabbit hole? Wanna swallow the red pill?

171

answered Oct 04 '22 04:10

Iterator

Related questions
                            
                                Correcting for robust/clustered standard errors within the lm function or replacing the results
                            
                                Encoding problem when your package contains functions with non-english characters
                            
                                How to sum n highest values by row using dplyr without reshaping?
                            
                                Converting quartic kernel heatmap into large polygon with R
                            
                                Why is R row extraction slower on large sparse Matrix than when dividing it into smaller pieces and then extracting?
                            
                                How to access the shell in google Colab when running the R kernel
                            
                                How can I add an en dash to a plot in R?
                            
                                Why the field separator character must be only one byte?
                            
                                Using seq.Date as breaks in stat_bin for time-series
                            
                                Basic R - Outputting basic R correlation table -> LaTex or text
                            
                                Change stack order of True and False in R/ggplot2
                            
                                how do i get a 45 angle for the x axis labels in the following code [duplicate]
                            
                                update R packages while R is running
                            
                                Error bars show through open symbol
                            
                                Trophic position/height in food webs (following paths in networks)
                            
                                Documenting setAs() and setOldClass() with Roxygen
                            
                                Merge two list components
                            
                                Why are default values not dispatched with UseMethod?
                            
                                Merging dataframes in R on a pre-sorted column?
                            
                                Programming a QQ plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With