Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Technique for finding bad data in read.csv in R

Tags:

r

I am reading in a file of data that looks like this:

userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"[email protected]","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"

but the actual file as around 20000 records. I use the following R code to read it in:

user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")

And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.

Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.

Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:

  1. Why does the quote="" fix the 'bad data'
  2. What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).

Here are the lines 'near' the 'problem'. I don't see the damage do you?

"16888","user1","[email protected]","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","[email protected]","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","[email protected]","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","[email protected]","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","[email protected]","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"

* Update *

It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.

So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?

* UPDATE 2 *

@Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.

3     4    22    23    24 
1    83 15466   178     4 

Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:

[1]  2  4  2  3  5  3  3  3  5  3  3  3  2  3  4  2  3  2  2  3  2  2  4  2  4  3  5  4  3  4  3  3  3  3  3  2  4

Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!

like image 379
pitosalas Avatar asked Apr 02 '13 18:04

pitosalas


2 Answers

The count.fields function can be very useful in identifying where to look for malformed data.

This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:

table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") ) 

This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:

table( count.fields("~/Desktop/dbdump/users.txt",  quote="", comment.char="") )

Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):

which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)

Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.

like image 127
IRTFM Avatar answered Sep 28 '22 08:09

IRTFM


One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?

If I add the "missing" quote after John Smith, it reads fine.

I saved this data to data.txt:

userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","[email protected]","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","[email protected]","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","[email protected]","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","[email protected]","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","[email protected]","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","[email protected]","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"

And this is a code. Both fread and read.csv works fine.

require(data.table)

dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1

dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2
like image 31
djhurio Avatar answered Sep 28 '22 08:09

djhurio