Reading big data with fixed width

Tags:

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines.

I'm using:

dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='',      colClasses=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'character', 'character',     'integer', 'integer', 'character', 'integer', 'integer', 'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character',     'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character',     'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'integer',     'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'integer', 'integer', 'character', 'character', 'character',     'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character'), buffersize=180000)

But it takes 30 minutes (and counting...) to read the data. Any new suggestions?

641

asked Sep 10 '13 13:09

Rcoster

1 Answers

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

Option 1: `csvkit` + your favorite method to quickly read large data

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length     name,0,30      birthday,30,10      age,40,3

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

Option 2: `sqldf` using `substr`

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")  ## Construct your "substr" command GetMe <- paste("select",                 paste("substr(V1, ", temp$start, ", ",                      temp$length, ") `", temp$column, "`",                       sep = "", collapse = ", "),                 "from fixed", sep = " ")  ## Load "sqldf" library(sqldf)  ## Connect to your file fixed <- file("TS_MATRICULA_RS.txt") myDF <- sqldf(GetMe, file.format = list(sep = "_"))

Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile() cat(file = ff, "123456", "987654", sep = "\n") read.fwf(ff, widths = c(1, 2, 3))  widths <- c(1, 2, 3) length <- cumsum(widths) start <- length - widths + 1 column <- paste("V", seq_along(length), sep = "")  GetMe <- paste("select",                 paste("substr(V1, ", start, ", ",                      widths, ") `", column, "`",                       sep = "", collapse = ", "),                 "from fixed", sep = " ")  library(sqldf)  ## Connect to your file fixed <- file(ff) myDF <- sqldf(GetMe, file.format = list(sep = "_")) myDF unlink(ff)

196

answered Oct 12 '22 23:10

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                Is `a<b<c` valid python?
                            
                                TypeScript in Visual Studio 2012 not compiling
                            
                                How to create WindowsIdentity/WindowsPrincipal from username in DOMAIN\user format
                            
                                What grayscale conversion algorithm does OpenCV cvtColor() use?
                            
                                Rails:How to create a time column with timezone on postgres
                            
                                Is it possible to sync a single file to s3?
                            
                                Git tags don't show up as GitHub releases
                            
                                Draw a Chart.js with ajax data and responsive. A few problems and questions
                            
                                Connecting to RDS Postgres from remote machine
                            
                                How to store a binary object in redis using node?
                            
                                How does ':remote => true' works in rails
                            
                                Ember with Node (MEEN stack?)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading big data with fixed width

Tags:

Rcoster

People also ask

1 Answers

Option 1: `csvkit` + your favorite method to quickly read large data

Option 2: `sqldf` using `substr`

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us

Reading big data with fixed width

Tags:

Rcoster

People also ask

1 Answers

Option 1: csvkit + your favorite method to quickly read large data

Option 2: sqldf using substr

A5C1D2H2I1M1N2O1R2T1

Related questions

Recent Activity

Donate For Us

Option 1: `csvkit` + your favorite method to quickly read large data

Option 2: `sqldf` using `substr`