I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.
I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)
The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.
The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.
Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)
The main difference between sed and awk is that sed is a command utility that works with streams of characters for searching, filtering and text processing while awk more powerful and robust than sed with sophisticated programming constructs such as if/else, while, do/while etc.
For me it's about efficiency and not interrupting workflow. I only use sed/awk for simple tasks when I know it will be faster to type out and execute than a perl/python script.
awk is most useful when handling text files that are formatted in a predictable way. For instance, it is excellent at parsing and manipulating tabular data. It operates on a line-by-line basis and iterates through the entire file. By default, it uses whitespace (spaces, tabs, etc.) to separate fields.
AWK is worth learning completely. It hits a real sweet spot in terms of minimizing the number of lines of code needed to write useful programs in the world of quasi-structured (not quite CSV but not completely free form) data. You can learn the whole language and become proficient in an afternoon.
Not to spoil your adventure, but I'd say no and here is why:
and most importantly: you already know R.
That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.
I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.
I'm honestly at a loss to think why one would learn sed and awk over Perl.
For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With