Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fread segfault on sparse 132MB file with 613 columns of survey data

I've been learning data.table recently. However, when I use fread to read the data from "http://dl.dropbox.com/u/20498362/GSS.csv", R crashes with a segfault. How can I investigate this further? To reproduce just download the file and type :

fread("GSS.csv")

The file has many NA variables; the first column is also missing a column name. However, it still does not work if I add "rownames=TRUE".

Thanks!

like image 425
David Wang Avatar asked Oct 02 '22 21:10

David Wang


1 Answers

Update : now fixed in v1.9.4 on CRAN.


Previous answer ...

Many thanks for the reproducible example! I also see the crash. Fantastic!!

Let's turn on verbose=TRUE to get more clues...

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(data.table)
Loading required package: data.table
data.table 1.8.10  For help type: help("data.table")

> fread("GSS.csv", verbose=TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 613 columns
First row with 613 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 55088
Subtracted 1 for last eol and any trailing empty lines, leaving 55087 data rows
Type codes: 3002000030033030000033003000000033000300330000000030000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003330003330000000000000000000000000000000000000000000000000003330000000000000003000303000000000000000000000000000000000033000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000303 (first 5 rows)
Type codes: 3002000030033030000033003330000033032300333300000033000033330000000000000000000000000000000000000000000000000003300003333333330000000000000000000000300030000000000000000000000000000000000000000000000000000000000000000003333300003330000000033000000000000000000000000000000000000000000000000000000000000000000000000333000000000000300000003333333330000000000000000000000000000000000000000000000000003332000000000000003303333000000000000000003330000003000000333333333333333333333333300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030033333330300000000000333 (+middle 5 rows)
Type codes: 3002000033033033000033003333000033032300333300000033000033333330000000000000000000000000000000000000000000000003300003333333330000000000000000000000300030000000000000300000000000000000000000000000000000000000000000000003333300003330000000033000000000000000000000000000000000000000000000000000000000000000000000000333000000000000300000003333333330000000000000000000000000000000000000000000000000003332200000300033003303333000000000000000003330003333000000333333333333333333333333300000000000000000000000030030000000000000000000000000000000000000000000000000000000000000000000000000000000030033333330300000000000333 (+last 5 rows)
Bumping column 39 from INT to INT64 on data row 1614, field contains '"working class"'
Bumping column 39 from INT64 to REAL on data row 1614, field contains '"working class"'
Bumping column 39 from REAL to STR on data row 1614, field contains '"working class"'
Bumping column 225 from INT to INT64 on data row 1614, field contains '"disagree"'
Bumping column 225 from INT64 to REAL on data row 1614, field contains '"disagree"'
Bumping column 225 from REAL to STR on data row 1614, field contains '"disagree"'
Bumping column 226 from INT to INT64 on data row 1614, field contains '"disagree"'
Bumping column 226 from INT64 to REAL on data row 1614, field contains '"disagree"'
Bumping column 226 from REAL to STR on data row 1614, field contains '"disagree"'
Bumping column 227 from INT to INT64 on data row 1614, field contains '"disagree"'
Bumping column 227 from INT64 to REAL on data row 1614, field contains '"disagree"'
Bumping column 227 from REAL to STR on data row 1614, field contains '"disagree"'
Bumping column 228 from INT to INT64 on data row 1614, field contains '"disagree"'
Bumping column 228 from INT64 to REAL on data row 1614, field contains '"disagree"'
Bumping column 228 from REAL to STR on data row 1614, field contains '"disagree"'
Bumping column 232 from INT to INT64 on data row 1614, field contains '"agree"'
Bumping column 232 from INT64 to REAL on data row 1614, field contains '"agree"'
Bumping column 232 from REAL to STR on data row 1614, field contains '"agree"'
Bumping column 233 from INT to INT64 on data row 1614, field contains '"agree"'
Bumping column 233 from INT64 to REAL on data row 1614, field contains '"agree"'
Bumping column 233 from REAL to STR on data row 1614, field contains '"agree"'
Bumping column 307 from INT to INT64 on data row 1614, field contains '"no"'
Bumping column 307 from INT64 to REAL on data row 1614, field contains '"no"'
Bumping column 307 from REAL to STR on data row 1614, field contains '"no"'
Bumping column 308 from INT to INT64 on data row 1614, field contains '"no"'
Bumping column 308 from INT64 to REAL on data row 1614, field contains '"no"'
Bumping column 308 from REAL to STR on data row 1614, field contains '"no"'
Bumping column 309 from INT to INT64 on data row 1614, field contains '"no"'
Bumping column 309 from INT64 to REAL on data row 1614, field contains '"no"'
Bumping column 309 from REAL to STR on data row 1614, field contains '"no"'
Bumping column 310 from INT to INT64 on data row 1614, field contains '"no"'
Bumping column 310 from INT64 to REAL on data row 1614, field contains '"no"'
Bumping column 310 from REAL to STR on data row 1614, field contains '"no"'
Bumping column 311 from INT to INT64 on data row 1614, field contains '"no"'
Bumping column 311 from INT64 to REAL on data row 1614, field contains '"no"'
Bumping column 311 from REAL to STR on data row 1614, field contains '"no"'
Bumping column 3 from INT to INT64 on data row 9121, field contains '2.54999995231628'
Bumping column 3 from INT64 to REAL on data row 9121, field contains '2.54999995231628'
Bumping column 234 from INT to INT64 on data row 9121, field contains '"not feel"'
Bumping column 234 from INT64 to REAL on data row 9121, field contains '"not feel"'
Bumping column 234 from REAL to STR on data row 9121, field contains '"not feel"'
Bumping column 235 from INT to INT64 on data row 9121, field contains '"feel"'
Bumping column 235 from INT64 to REAL on data row 9121, field contains '"feel"'
Bumping column 235 from REAL to STR on data row 9121, field contains '"feel"'
Bumping column 236 from INT to INT64 on data row 9121, field contains '"feel"'
Bumping column 236 from INT64 to REAL on data row 9121, field contains '"feel"'
Bumping column 236 from REAL to STR on data row 9121, field contains '"feel"'
Bumping column 237 from INT to INT64 on data row 9121, field contains '"not feel"'
Bumping column 237 from INT64 to REAL on data row 9121, field contains '"not feel"'
Bumping column 237 from REAL to STR on data row 9121, field contains '"not feel"'
Bumping column 238 from INT to INT64 on data row 9121, field contains '"feel"'
Bumping column 238 from INT64 to REAL on data row 9121, field contains '"feel"'
Bumping column 238 from REAL to STR on data row 9121, field contains '"feel"'
Bumping column 239 from INT to INT64 on data row 9121, field contains '"feel"'
Bumping column 239 from INT64 to REAL on data row 9121, field contains '"feel"'
Bumping column 239 from REAL to STR on data row 9121, field contains '"feel"'
Bumping column 2 from INT to INT64 on data row 12121, field contains '1.23500001430511'
Bumping column 2 from INT64 to REAL on data row 12121, field contains '1.23500001430511'
Bumping column 49 from INT to INT64 on data row 12121, field contains '"now and then"'
Bumping column 49 from INT64 to REAL on data row 12121, field contains '"now and then"'
Bumping column 49 from REAL to STR on data row 12121, field contains '"now and then"'
Bumping column 330 from INT to INT64 on data row 12121, field contains '"worst kind"'
Bumping column 330 from INT64 to REAL on data row 12121, field contains '"worst kind"'
Bumping column 330 from REAL to STR on data row 12121, field contains '"worst kind"'
Bumping column 609 from INT to INT64 on data row 12121, field contains '"good purpose"'
Bumping column 609 from INT64 to REAL on data row 12121, field contains '"good purpose"'
Bumping column 609 from REAL to STR on data row 12121, field contains '"good purpose"'
Bumping column 610 from INT to INT64 on data row 12121, field contains '"most of the time"'
Bumping column 610 from INT64 to REAL on data row 12121, field contains '"most of the time"'
Bumping column 610 from REAL to STR on data row 12121, field contains '"most of the time"'
Bumping column 98 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 98 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 98 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 99 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 99 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 99 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 100 from INT to INT64 on data row 15580, field contains '"strongly agree"'
Bumping column 100 from INT64 to REAL on data row 15580, field contains '"strongly agree"'
Bumping column 100 from REAL to STR on data row 15580, field contains '"strongly agree"'
Bumping column 101 from INT to INT64 on data row 15580, field contains '"somewht disagree"'
Bumping column 101 from INT64 to REAL on data row 15580, field contains '"somewht disagree"'
Bumping column 101 from REAL to STR on data row 15580, field contains '"somewht disagree"'
Bumping column 102 from INT to INT64 on data row 15580, field contains '"strongly agree"'
Bumping column 102 from INT64 to REAL on data row 15580, field contains '"strongly agree"'
Bumping column 102 from REAL to STR on data row 15580, field contains '"strongly agree"'
Bumping column 103 from INT to INT64 on data row 15580, field contains '"strongly agree"'
Bumping column 103 from INT64 to REAL on data row 15580, field contains '"strongly agree"'
Bumping column 103 from REAL to STR on data row 15580, field contains '"strongly agree"'
Bumping column 104 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 104 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 104 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 250 from INT to INT64 on data row 15580, field contains '"somewht disagree"'
Bumping column 250 from INT64 to REAL on data row 15580, field contains '"somewht disagree"'
Bumping column 250 from REAL to STR on data row 15580, field contains '"somewht disagree"'
Bumping column 251 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 251 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 251 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 252 from INT to INT64 on data row 15580, field contains '"somewht disagree"'
Bumping column 252 from INT64 to REAL on data row 15580, field contains '"somewht disagree"'
Bumping column 252 from REAL to STR on data row 15580, field contains '"somewht disagree"'
Bumping column 254 from INT to INT64 on data row 15580, field contains '"somewht disagree"'
Bumping column 254 from INT64 to REAL on data row 15580, field contains '"somewht disagree"'
Bumping column 254 from REAL to STR on data row 15580, field contains '"somewht disagree"'
Bumping column 256 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 256 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 256 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 257 from INT to INT64 on data row 15580, field contains '"somewhat agree"'
Bumping column 257 from INT64 to REAL on data row 15580, field contains '"somewhat agree"'
Bumping column 257 from REAL to STR on data row 15580, field contains '"somewhat agree"'
Bumping column 105 from INT to INT64 on data row 15581, field contains '"somewhat agree"'
Bumping column 105 from INT64 to REAL on data row 15581, field contains '"somewhat agree"'
Bumping column 105 from REAL to STR on data row 15581, field contains '"somewhat agree"'
Bumping column 253 from INT to INT64 on data row 15581, field contains '"strngly disagree"'
Bumping column 253 from INT64 to REAL on data row 15581, field contains '"strngly disagree"'
Bumping column 253 from REAL to STR on data row 15581, field contains '"strngly disagree"'
Bumping column 255 from INT to INT64 on data row 15581, field contains '"strngly disagree"'
Bumping column 255 from INT64 to REAL on data row 15581, field contains '"strngly disagree"'
Bumping column 255 from REAL to STR on data row 15581, field contains '"strngly disagree"'
Bumping column 64 from INT to INT64 on data row 15584, field contains '"too little"'
Bumping column 64 from INT64 to REAL on data row 15584, field contains '"too little"'
Bumping column 64 from REAL to STR on data row 15584, field contains '"too little"'
Bumping column 65 from INT to INT64 on data row 15584, field contains '"too little"'
Bumping column 65 from INT64 to REAL on data row 15584, field contains '"too little"'
Bumping column 65 from REAL to STR on data row 15584, field contains '"too little"'
Bumping column 66 from INT to INT64 on data row 15584, field contains '"too little"'
Bumping column 66 from INT64 to REAL on data row 15584, field contains '"too little"'
Bumping column 66 from REAL to STR on data row 15584, field contains '"too little"'
Bumping column 67 from INT to INT64 on data row 15584, field contains '"too little"'
Bumping column 67 from INT64 to REAL on data row 15584, field contains '"too little"'
Bumping column 67 from REAL to STR on data row 15584, field contains '"too little"'
Bumping column 71 from INT to INT64 on data row 17053, field contains '"pay more"'
Bumping column 71 from INT64 to REAL on data row 17053, field contains '"pay more"'
Bumping column 71 from REAL to STR on data row 17053, field contains '"pay more"'
Bumping column 72 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 72 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 72 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 73 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 73 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 73 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 74 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 74 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 74 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 75 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 75 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 75 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 76 from INT to INT64 on data row 17053, field contains '"in favor"'
Bumping column 76 from INT64 to REAL on data row 17053, field contains '"in favor"'
Bumping column 76 from REAL to STR on data row 17053, field contains '"in favor"'
Bumping column 77 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 77 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 77 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 78 from INT to INT64 on data row 17053, field contains '"neither"'
Bumping column 78 from INT64 to REAL on data row 17053, field contains '"neither"'
Bumping column 78 from REAL to STR on data row 17053, field contains '"neither"'
Bumping column 79 from INT to INT64 on data row 17053, field contains '"spend same"'
Bumping column 79 from INT64 to REAL on data row 17053, field contains '"spend same"'
Bumping column 79 from REAL to STR on data row 17053, field contains '"spend same"'
Bumping column 80 from INT to INT64 on data row 17053, field contains '"spend more"'
Bumping column 80 from INT64 to REAL on data row 17053, field contains '"spend more"'
Bumping column 80 from REAL to STR on data row 17053, field contains '"spend more"'
Bumping column 81 from INT to INT64 on data row 17053, field contains '"spend same"'
Bumping column 81 from INT64 to REAL on data row 17053, field contains '"spend same"'
Bumping column 81 from REAL to STR on data row 17053, field contains '"spend same"'
Bumping column 82 from INT to INT64 on data row 17053, field contains '"spend more"'
Bumping column 82 from INT64 to REAL on data row 17053, field contains '"spend more"'
Bumping column 82 from REAL to STR on data row 17053, field contains '"spend more"'
Bumping column 83 from INT to INT64 on data row 17053, field contains '"spend less"'
Bumping column 83 from INT64 to REAL on data row 17053, field contains '"spend less"'
Bumping column 83 from REAL to STR on data row 17053, field contains '"spend less"'
Bumping column 84 from INT to INT64 on data row 17053, field contains '"spend same"'
Bumping column 84 from INT64 to REAL on data row 17053, field contains '"spend same"'
Bumping column 84 from REAL to STR on data row 17053, field contains '"spend same"'
Bumping column 85 from INT to INT64 on data row 17053, field contains '"spend same"'

 *** caught segfault ***
address 0x56a24, cause 'memory not mapped'

Traceback:
 1: fread("GSS.csv", verbose = TRUE)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

What appears to be happening is the 132MB file is quite sparse (many blank fields). There are 613 columns and 55087 rows. Owing to the sparseness, the first 5, middle 5 and last 5 rows aren't enough to detect that those columns are character. When it gets to the first populated field of such columns, it correctly promotes the column type for a lot of the columns, something that normally works fine. Then it crashes.

Thank you very much! I've filed a bug report here :

#493: Reproducible crash in fread

like image 101
Matt Dowle Avatar answered Oct 05 '22 12:10

Matt Dowle