I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R. The File looks some what like this :
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
It is not clear here , I have attached it , please find.
It reads easily in SAS using infile input.
SAS Code:
infile "filename.txt" lrecl=32767 ;
input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;
run;
Now to do the same in R , I have been trying various things:
1)Atfirst read.fwf , Code:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
But it returns just NA values for most of the fields and only some values that too in wrong positions.
Head(dat1) gives output:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Output:
2) Now I use the Sascii package to call the SAS code in R. Code:
sas_imp <- "input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
It too gives the same useless output as above.
3) Now I use the Laf package and the laf_open_fwf command like :
library(LaF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
Then I convert it like :
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
But it gives output:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4) Lastly read.table.ffdf like
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
But result is same. The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].
What am I doing wrong , am I putting the widths wrong ? Or I am getting the ideas wrong altogether? I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.
Please see here for what I use nowadays for this problem:
Faster way to read fixed-width files
for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅
Here's the FW -> .csv converter I created in Python to destroy these awful files:
It also includes the checkLength
function that can help get at what @RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H
to fix).
Please note dictfile
must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)
import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS
def checkLength(ffile):
"""
Used to check that all lines in file have the same length (and so don't cause any issues below)
"""
with open(ffile,'r') as ff:
firstrow=1
troubles=0
for rows in ff:
if firstrow:
length=len(rows)
firstrow=0
elif len(rows) != length:
print rows
print len(rows)
troubles=1
return troubles
def fixed2csv(infile,outfile,dictfile):
"""
This function takes a file name for a fixed-width dataset as input and
converts it to .csv format according to slices and column names specified in dictfile
Parameters
==========
infile: string of input file name from which fixed-width data is to be read
e.g. 'fixed_width.dat'
outfile: string of output file name to which comma-separated data is to be saved
e.g. 'comma_separated.csv'
dictfile: .csv-formatted dictionary file name from which to read the following:
* widths: field widths
* column names: names of columns to be written to the output .csv
* types: object types (character, integer, etc)
column order must be: col_names,slices,types
"""
with open(dictfile,'r') as dictf:
fieldnames = ("col_names","widths","types") #types used in R later
ddict = csv.DictReader(dictf,fieldnames)
slices=[]
colNames=[]
wwidths=[]
for rows in ddict:
wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
colNames.append(rows['col_names'])
offset = 0
for w in wwidths:
slices.append(slice(offset,offset+w))
offset+=w
with open(infile,'r') as fixedf:
with open(outfile,'w') as csvf:
csvfile=csv.writer(csvf)
csvfile.writerow(colNames)
for rows in fixedf:
csvfile.writerow([rows[s] for s in slices])
Good luck, and curses be on whoever it is that is proliferating these FW format data files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With