I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"[email protected]","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote=""
is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote=""
overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date()
on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
quote=""
fix the 'bad data'read.csv
to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","[email protected]","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","[email protected]","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","[email protected]","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","[email protected]","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","[email protected]","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
@Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!
The count.fields
function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE
if the only difficulty is missing commas at the ends of lines.
One problem I have spotted (thanks to data.table
) is the missing quote (") after John Smith
. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith
, it reads fine.
I saved this data to data.txt
:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","[email protected]","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","[email protected]","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","[email protected]","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","[email protected]","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","[email protected]","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","[email protected]","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread
and read.csv
works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With