I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text. I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.) The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal. The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk. Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)

Not to spoil your adventure, but I'd say no and here is why: <ul> <li>R is vectorised where sed/awk are not</li> <li>R already has both Perl regular expression and extended regular expressions</li> <li>R can more easily make recourse to statistical routines (say, imputation) if you need it</li> <li>R can visualize, summarize, ...</li> </ul> and most importantly: you already know R. That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.

I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed. <ul> <li>Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)</li> <li>Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.</li> <li>Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)</li> <li>Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.</li> </ul> I'm honestly at a loss to think why one would learn sed and awk over Perl. For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.

Python or awk/sed for cleaning data [closed]

Tags:

python

r

sed

awk

data-cleaning

I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.

I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)

The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.

The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.

Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)

735

asked Sep 20 '11 03:09

Charlie

2 Answers

Not to spoil your adventure, but I'd say no and here is why:

R is vectorised where sed/awk are not
R already has both Perl regular expression and extended regular expressions
R can more easily make recourse to statistical routines (say, imputation) if you need it
R can visualize, summarize, ...

and most importantly: you already know R.

That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.

116

answered Sep 30 '22 08:09

Dirk Eddelbuettel

I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.

Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)
Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.
Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)
Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.

I'm honestly at a loss to think why one would learn sed and awk over Perl.

For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.

answered Sep 30 '22 08:09

Reece

Related questions
                            
                                conda environment has no name visible in conda env list - how do I activate it at the shell?
                            
                                Python optparse metavar
                            
                                What is the difference between a site and an app in Django?
                            
                                Highlighting python stack traces
                            
                                What is the difference between sys and os.sys
                            
                                In Python, what is the difference between an object and a dictionary?
                            
                                How can I copy files bigger than 5 GB in Amazon S3?
                            
                                Python matplotlib change default color for values exceeding colorbar range
                            
                                How to use multiprocessing with class instances in Python?
                            
                                python - OpenCV mat::convertTo in python
                            
                                What are the parameters for sklearn's score function?
                            
                                Keeping NaN values and dropping nonmissing values
                            
                                How to convert a 16 bit to an 8 bit image in OpenCV?
                            
                                Python: yield and yield assignment
                            
                                Installing anaconda over existing python system?
                            
                                How to properly mask a numpy 2D array?
                            
                                Querying with function on Flask-SQLAlchemy model gives BaseQuery object is not callable error
                            
                                How to get the latest frame from capture device (camera) in opencv
                            
                                How do I specify a range of unicode characters
                            
                                Python: Yield Dict Elements in generators?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With