Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R workflow: How to handle hand-cleaning data

Tags:

r

Let me first say that I assiduously avoid hand-cleaning data in favor of regular expressions and the like. However, occasionally it is inevitable.

I use something like the Load-Clean-Func-Do workflow normally, so this obviously fits into the cleaning phase. However, any hand-editing breaks the ability to run the stuff before the hand-cleaning if it needs updating.

I can think of at least three ways to handle this:

  1. Put the by-hand changes as early in the workflow as possible, so that everything after that remains runnable.
  2. Write out regexes or assignment operations for every single change.
  3. Use a tool that generates (2) for you after you close the spreadsheet where you've made the changes.

The problem with 2 is that it can be extremely unweildy. The problem with 3 is that I'm unaware of any such tool existing for R. Stata has an extremely good implementation of this.

So the questions are:

  • Which results in the most replicable code with the least-frustrating code writing?
  • Does a tool as in (3) exist?
like image 276
Ari B. Friedman Avatar asked Sep 21 '12 13:09

Ari B. Friedman


1 Answers

I agree that hand-cleaning is generally a rather bad idea. However, sometimes it is unavoidable. I'd suggest one of the two, or both:

  1. Keep a separate data file with "data fixing" containing three variables "case_id", "variable_name", "value". Use it to store information about which values in the original data need to be replaced. You may add some additional variables to extra information about cleaning (e.g. why value on variable "variable_name" need to be replaced with "value" for case "case_id", etc.). Then have a short piece of R code, which loads your original data and then cleans it with the additional information in the "fixing" file.

  2. Perhaps you should start using some version control system like git or subversion (there are other progs also). Every hand-made change to the data could be recorded in the system as a separate commit. By the end of the day, you will be able to easily check the log for what change you made to the data and when. Moreover, you will be able to generate patch files that transform original data files to the cleaned ones. It is also beneficial to have your R code files version-controlled.

like image 177
Michał Avatar answered Nov 20 '22 10:11

Michał