Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R read data set which has unequal column

Tags:

r

csv

readr

I have a .csv data set which is separated by "," and has about 5,000 rows and "5" columns.

However, for some columns, the content contains also ",", for example:

2660,11-01-2016,70.75,05-06-2013,I,,,

4080,26-02-2016,59.36,,D

Thus, when I tried to read it with read_delim(), it will throw me warnings, but the result shall be fine, for example:

Warning: 7 parsing failures.

row # A tibble: 5 x 5 col row col expected actual file expected actual 1 309 5 columns 8 columns 'data/my_data.csv' file 2 523 5 columns 7 columns 'data/my_data.csv' row 3 588 5 columns 8 columns 'data/my_data.csv' col 4 1661 5 columns 9 columns 'data/my_data.csv' expected 5 1877 5 columns 7 columns 'data/my_data.csv'

Is there any way for me to tackle this problem?

I guess I could use read_Lines() and process it one by one and then turn them into a data frame.

Do you have any other ways to deal with such a situation?

like image 275
rz.He Avatar asked Oct 19 '25 10:10

rz.He


1 Answers

1) read.table with fill=TRUE Using fill=TRUE with read.table results in no warnings:

Lines <- "2660,11-01-2016,70.75,05-06-2013,I,,,
4080,26-02-2016,59.36,,D"

# replace text = Lines with your filename    
read.table(text = Lines, sep = ",", fill = TRUE)

giving:

    V1         V2    V3         V4 V5 V6 V7 V8
1 2660 11-01-2016 70.75 05-06-2013  I NA NA NA
2 4080 26-02-2016 59.36             D NA NA NA

2) replace 1st 4 commas with semicolon Another approach would be:

# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
for(i in 1:4) L <- sub(",", ";", L)
read.table(text = L, sep = ";")

giving:

    V1         V2    V3         V4   V5
1 2660 11-01-2016 70.75 05-06-2013 I,,,
2 4080 26-02-2016 59.36               D

3) remove commas at end of lines Another possibility is to remove commas at the end of lines. (If you are on Windows then sed is in the Rtools distribution.)

read.table(pipe("sed -e s/,*$// readtest.csv"), sep = ",")

giving:

    V1         V2    V3         V4 V5
1 2660 11-01-2016 70.75 05-06-2013  I
2 4080 26-02-2016 59.36             D

3a) similar to (3) but without sed

# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
read.table(text = sub(",*$", "", L), sep = ",")
like image 54
G. Grothendieck Avatar answered Oct 22 '25 00:10

G. Grothendieck



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!