Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle ASCII in R

I have some files of Micro Data from a Populational Census stored as .txt and coded in ASCII. When I open them in a text editor I get something like: 1100015110001500100100003624008705865085282310200600101011022022 14 444231 etc.

Since I have no expirience with the tabulation of ASCII data I would like to know if there is any way to get this done with R and/or what type of suplementary software do I need.

Actually at first I would like to have a "normal" look at my data, as to say, to see it as a table if possible (the filesizes vary between 40mb and 500mb). Then I would like to make some simple calculations and store the results later as a csv. to use it in other contexts.

Can anyone give me some advice?

like image 572
Joschi Avatar asked Dec 27 '22 12:12

Joschi


2 Answers

this brazilian census website provides a SAS importation script. the quickest way to import an ASCII data set with only a SAS importation script is to use the SAScii package. you can find the SAS importation script inside this zipped file -- it's INPUT.txt. notice that the INPUT block of those SAS importation instructions don't start until the fourth line, so your beginline parameter will be 4. test out that you're reading the SAS script correctly first with ?parse.SAScii

library(SAScii)
parse.SAScii( "INPUT.txt" , beginline = 4 )

once you see that that's printed the column names and widths correctly, you can use the ?read.SAScii function to directly read your text file into an R data frame

x <- read.SAScii( "filename.txt" , "INPUT.txt" , beginline = 4 )
head( x )

if your file is too big to read entirely into RAM, you can instead read it into a SQLite database. use the read.SAScii.sqlite() function found not in the SAScii package but in my github account here -- it's just a slight variation of the read.SAScii() function, but it doesn't overload RAM. you can see an example of its usage in the download script on this united states government survey data set website.

for more detail about the SAScii package, check out this overview

like image 69
Anthony Damico Avatar answered Jan 10 '23 18:01

Anthony Damico


A good alternative is the package readr, an extremely fast solution to read fixed column width data. More info on readr here.

So instead of read.SAScii, you can use a faster option based in readr. Like this:

# Load Packages
  library(readr)
  library(SAScii)
  library(data.table)


# Parse input file
  dic_pes2013 <- parse.SAScii("INPUT.txt")

  setDT(dic_pes2013) # convert to data.table

# read to data frame
  pesdata2 <- read_fwf("./Dados/PES2013.txt", 
                       fwf_widths(dput(dic_pes2013[,width]),
                                  col_names=(dput(dic_pes2013[,varname]))),
                                  progress = interactive()
                                  )

I've just read 2.4 million records with 243 variables in 1.2 minutes (file Amostra_Pessoas_35_outras.txt).

ps. if you don't have the input.txt files, here is short script on how to create them.

Note that some variables have decimals, something that is not incorporated in the solutions provided by the answers posted here (at least so far). To take this into account, I would recommend this R script here , which will help you download the 2010 Brazilian Census data sets, read them into data frames and save them as .csv files.

like image 43
rafa.pereira Avatar answered Jan 10 '23 18:01

rafa.pereira