Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python or awk/sed for cleaning data [closed]

I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.

I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)

The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.

The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.

Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)

like image 735
Charlie Avatar asked Sep 20 '11 03:09

Charlie


People also ask

Should I use sed or awk?

The main difference between sed and awk is that sed is a command utility that works with streams of characters for searching, filtering and text processing while awk more powerful and robust than sed with sophisticated programming constructs such as if/else, while, do/while etc.

Is sed faster than Python?

For me it's about efficiency and not interrupting workflow. I only use sed/awk for simple tasks when I know it will be faster to type out and execute than a perl/python script.

When should I use awk?

awk is most useful when handling text files that are formatted in a predictable way. For instance, it is excellent at parsing and manipulating tabular data. It operates on a line-by-line basis and iterates through the entire file. By default, it uses whitespace (spaces, tabs, etc.) to separate fields.

Is learning awk worth it?

AWK is worth learning completely. It hits a real sweet spot in terms of minimizing the number of lines of code needed to write useful programs in the world of quasi-structured (not quite CSV but not completely free form) data. You can learn the whole language and become proficient in an afternoon.


2 Answers

Not to spoil your adventure, but I'd say no and here is why:

  • R is vectorised where sed/awk are not
  • R already has both Perl regular expression and extended regular expressions
  • R can more easily make recourse to statistical routines (say, imputation) if you need it
  • R can visualize, summarize, ...

and most importantly: you already know R.

That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.

like image 116
Dirk Eddelbuettel Avatar answered Sep 30 '22 08:09

Dirk Eddelbuettel


I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.

  • Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)
  • Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.
  • Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)
  • Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.

I'm honestly at a loss to think why one would learn sed and awk over Perl.

For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.

like image 36
Reece Avatar answered Sep 30 '22 08:09

Reece