Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid Input causes read.csv to cut off data

Tags:

r

I have been trying to read a csv file into R, but it keeps cutting off. I think it might be due to the file encoding, but I'm not sure.

Here is the code I ran:

read.csv('crunchbase_companies_2.csv', fileEncoding="UTF-8", quote="")

I then get a warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,: invalid input found on input connection.

R reads the data, but only up to when it hits a special character and then stops. So I only end up with partial data in R. I pasted the data I get here: http://pastebin.com/EQLnXz2W. Note though it cuts off when it hits things like 'Ì'. So those characters are not in the sample data.

I have also checked the encoding in the terminal using file. It returns Non-ISO extended-ASCII English text, with CR line terminators.

What do I need to do to read the entire dataset?

like image 707
Brian Fabian Crain Avatar asked Oct 26 '13 19:10

Brian Fabian Crain


3 Answers

I ran into a similar problem today and spent hours on it. I try to change encoding/fileEncoding, setlocal, and a couple of other things found here. But none of them work for me.

Eventually I found a non-English post (those people probably have more experience with this) and this trick:change the open model from "r" to "rb".

In my case, I use readLines, so it's

fileIn=file("userinfo.csv",open="rb",encoding="UTF-8")
lines = readLines(fileIn, n = rowPerRead, warn = FALSE)

I don't fully understand why, my guess is that the Unicode character is in Byte, so if it's not read by Byte, that big guy will just block the scan.

like image 172
Yuan Ren Avatar answered Nov 10 '22 01:11

Yuan Ren


After hours struggling with a csv like this, experimenting with arguments to read.csv like fileEncoding and quote I finally used read_csv in the readr package - simply with the default arguments - and it loaded everything perfectly straight away!

An unimaginative answer but worth trying before you attempt to reverse engineer the whole file yourself...

like image 36
Tom Wagstaff Avatar answered Nov 10 '22 03:11

Tom Wagstaff


So while I don't quite know why, what ended up working is changing fileEncoding to latin1 when calling the read.csv function.

This was mentioned in a different answer here. Somehow that's one thing I hadn't tried...

like image 10
Brian Fabian Crain Avatar answered Nov 10 '22 03:11

Brian Fabian Crain