Consider the following few lines from a Stata .dct file which defines for Stata how to read this fixed width ASCII file (can be decompressed with any ZIP software on any platform):
start type varname width description
_column(24) long rfv1 %5f Patient's Reason for Visit #1
_column(29) long rfv2 %5f Patient's Reason for Visit #2
_column(34) long rfv3 %5f Patient's Reason for Visit #3
_column(24) long rfv13d %4f Patient's Reason for Visit #1 - broad
_column(29) long rfv23d %4f Patient's Reason for Visit #2 - broad
_column(34) long rfv33d %4f Patient's Reason for Visit #3 - broad
Basically the 24th through 39th characters in every row of this ASCII file look like this:
AAAAaBBBBbCCCCc
Where the first broad code is AAAA
, the narrower code for that same reason is AAAAa
, etc.
In other words, because the codes themselves have a heirarchical structure, the same characters in every row are read twice to create two different variables.
read.fwf
, by contrast, just takes a widths
argument, which precludes this type of double-reading.
Is there a standard way of handling this, without recreating the wheel from scratch by scan
ning in the entire file and parsing it by hand?
The background here is that I'm writing a function to parse these .DCT files, in the style of SAScii, and my job would be much simpler if I could specify (start, width)
pairs for every variable rather than just widths
.
I had started working on a .DCT parser but lost steam. My intended usage scenario was to actually simply parse the file and create a csvkit schema file to allow me to use csvkit to convert the file from fixed width to csv. To that end, the package was successful, but it is very un-refined and only very minimally tested.
A couple of the problems to look out for include (1) not all DCT files have the same columns; (2) some DCT files have instructions for implicit decimal places, and I never got around to coming up with a method for dealing with those types of files.
You can find the initial work on the package here.
The main functions are:
dct.parser
-- Does what you would expect. There is a "preview" argument that reads in the first few lines to let you determine whether the DCT file has all the columns you expect or not.csvkit.schema
-- Using the info extracted from dct.parser
, a csv file is created with the relevant columns required by in2csv
from csvkit.csvkit.fwf2csv
-- Basically a system
call to csvkit. Can also be done outside of R.For your particular example, I successfully read it using:
## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/")
dct.parser("ed02.dct", preview=TRUE) ## It seems that everything is there
temp <- dct.parser("ed02.dct") ## Can be used as a lookup table later
## The next line automatically creates a csv schema file in your
## working directory named, by default, "your-dct-filename.csv"
csvkit.schema(temp)
csvkit.fwf2csv(datafile = "ED02", schema="ed02.dct.csv", output="ED02.csv")
## I haven't set up any mechanism to check on progress...
## Just check the directory and see when the file stops growing :)
ED02 <- read.csv("ED02.csv")
Another alternative that I had intended to work on (but never did) was to use paste
to construct substr
commands that could be used by sqldf
to read in data where the columns contain overlapping data. See this blog post for an example to get started.
sqldf
exampleAs mentioned above, sqldf
can make good use of the output of dct.parser
and read in your data by using substr
. Here's an example of how you would do this:
## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/")
temp <- dct.parser("ed02.dct") ## Can be used as a lookup table later
## Construct your "substr" command
GetMe <- paste("select",
paste("substr(V1, ", temp$StartPos, ", ",
temp$ColWidth, ") `", temp$ColName, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")
## Load "sqldf"
library(sqldf)
fixed <- file("ED02")
ED02 <- sqldf(GetMe, file.format = list(sep = "_"))
dim(ED02)
# [1] 37337 260
As can be seen, a little modification was necessary in the sqldf
line. In particular, since sqldf
uses read.csv.sql
, it saw any comma characters in your data as a delimiter. You can just change that to something you don't expect in the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With