I have some files of Micro Data from a Populational Census stored as .txt and coded in ASCII.
When I open them in a text editor I get something like:
1100015110001500100100003624008705865085282310200600101011022022 14 444231
etc.
Since I have no expirience with the tabulation of ASCII data I would like to know if there is any way to get this done with R and/or what type of suplementary software do I need.
Actually at first I would like to have a "normal" look at my data, as to say, to see it as a table if possible (the filesizes vary between 40mb and 500mb). Then I would like to make some simple calculations and store the results later as a csv. to use it in other contexts.
Can anyone give me some advice?
this brazilian census website provides a SAS importation script. the quickest way to import an ASCII data set with only a SAS importation script is to use the SAScii package. you can find the SAS importation script inside this zipped file -- it's INPUT.txt. notice that the INPUT block of those SAS importation instructions don't start until the fourth line, so your beginline
parameter will be 4. test out that you're reading the SAS script correctly first with ?parse.SAScii
library(SAScii)
parse.SAScii( "INPUT.txt" , beginline = 4 )
once you see that that's printed the column names and widths correctly, you can use the ?read.SAScii
function to directly read your text file into an R data frame
x <- read.SAScii( "filename.txt" , "INPUT.txt" , beginline = 4 )
head( x )
if your file is too big to read entirely into RAM, you can instead read it into a SQLite database. use the read.SAScii.sqlite()
function found not in the SAScii package but in my github account here -- it's just a slight variation of the read.SAScii() function, but it doesn't overload RAM. you can see an example of its usage in the download script on this united states government survey data set website.
for more detail about the SAScii package, check out this overview
A good alternative is the package readr
, an extremely fast solution to read fixed column width data. More info on readr
here.
So instead of read.SAScii
, you can use a faster option based in readr
. Like this:
# Load Packages
library(readr)
library(SAScii)
library(data.table)
# Parse input file
dic_pes2013 <- parse.SAScii("INPUT.txt")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("./Dados/PES2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
I've just read 2.4 million records with 243 variables in 1.2 minutes (file Amostra_Pessoas_35_outras.txt
).
ps. if you don't have the input.txt
files, here is short script on how to create them.
Note that some variables have decimals, something that is not incorporated in the solutions provided by the answers posted here (at least so far). To take this into account, I would recommend this R
script here , which will help you download the 2010 Brazilian Census data sets, read them into data frames and save them as .csv
files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With