Reading a fixed length text file in R

Question

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R. The File looks some what like this :

    DP                  JAMES                    SILVA                    REY                                                                                                                                                             




       2014
           6
            0
                1723713652
           2
             0
DP                  ALEJANDRA                                         NARVAEZ                                                                                                                                                         




       2014
           6
            0
                1723713456
           6
             0
DP                  NANYER                                            PICHARDO                                                                                                                                                        




       2014
           6
            0
                1723713991
           1
             0
DP                  GABRIELA                 ANASI                    CASTILLO                                                                                                                                                        




       2014
           6
            0
                1723713240
           3
             0

It is not clear here , I have attached it , please find.

It reads easily in SAS using infile input.

SAS Code:

infile  "filename.txt" lrecl=32767 ;

input

@001 park_cd      $5.

@006 Title $15.

@021 first_name  $25.

@046 middle_name $25.

@071 last_name   $25.

@096 suffix      $15.

@111 ADDRESS_1   $60.

@171 ADDRESS_2   $60.

@231 ADDRESS_3   $60.

@261 CITY  $30.

@291 STATE_PROVINCE    $2.

@293  ZIP    $9.

@302 Ticket_Year  $11.

@314 product_id  $12.

@327 UNIT_PRICE  $13.

@340 PURCHASE_DT $26.

@366 PURCHASE_QTY $12.

@378 TOTAL_PURCHASE_AMT $14. ;

run;

Now to do the same in R , I have been trying various things:

1)Atfirst read.fwf , Code:

dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
 widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
header = FALSE, sep = "	",fill = TRUE,
 skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
   "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
    " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
       "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")

But it returns just NA values for most of the fields and only some values that too in wrong positions.

Head(dat1) gives output:

  park_cd           Title                first_name               middle_name
1   DP                    JAMES                     SILVA                    
2                                                                            
3                                                                        <NA>
4                    <NA>                      <NA>                      <NA>
5                                              <NA>                      <NA>
6                    2014                      <NA>                      <NA>
                  last_name          suffix
1 REY                                      
2                      <NA>            <NA>
3                      <NA>            <NA>
4                      <NA>            <NA>
5                      <NA>            <NA>
6                      <NA>            <NA>
                                                    ADDRESS_1.
1                                                             
2                                                         <NA>
3                                                         <NA>
4                                                         <NA>
5                                                         <NA>
6                                                         <NA>
                                                     ADDRESS_2 ADDRESS_3 CITY
1                                                                     NA   NA
2                                                         <NA>        NA   NA
3                                                         <NA>        NA   NA
4                                                         <NA>        NA   NA
5                                                         <NA>        NA   NA
6                                                         <NA>        NA   NA
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             NA    NA          NA         NA         NA          NA           NA
2             NA    NA          NA         NA         NA          NA           NA
3             NA    NA          NA         NA         NA          NA           NA
4             NA    NA          NA         NA         NA          NA           NA
5             NA    NA          NA         NA         NA          NA           NA
6             NA    NA          NA         NA         NA          NA           NA
  TOTAL_PURCHASE_AMT.
1                  NA
2                  NA
3                  NA
4                  NA
5                  NA
6                  NA

Output:

2) Now I use the Sascii package to call the SAS code in R. Code:

sas_imp <- "input
@001 park_cd      $5.
@006 Title $15.
@021 first_name  $25.
@046 middle_name $25.
@071 last_name   $25.
@096 suffix      $15.
@111 ADDRESS_1   $60.
@171 ADDRESS_2   $60.
@231 ADDRESS_3   $60.
@261 CITY  $30.
@291 STATE_PROVINCE    $2.
@293  ZIP    $9.
@302 Ticket_Year  $11.
@314 product_id  $12.
@327 UNIT_PRICE  $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )

It too gives the same useless output as above.

3) Now I use the Laf package and the laf_open_fwf command like :

library(LaF)

data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
                     column_types=rep("character",18),
  column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
                  "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                  " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                  "PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
                     column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))

Then I convert it like :

library(ffbase)
my.data <- laf_to_ffdf(data) 
head(as.data.frame(my.data))

But it gives output:

park_cd Title first_name             middle_name                      last_name
1      DP            JAMES                   SILVA                            REY
2                                             
                           

3   JANDR     A                            NARVAEZ                               
4                     
                      
 
  
         
       20
5                 PICHARDO                                                       
6          
            
  
         
       2014
           6

             suffix
1                  
2 
         

3                  
4            14

5                  
6             0

                                                            ADDRESS_1.
1                                                                     
2         2014
           6
            0
                172
3                                                                     
4 6
            0
                1723713456
           6

5                                                                     
6                   1723713991
           1
             0
DP
                                                           ADDRESS_2 ADDRESS_3  CITY
1                                                                           
 *\003
2 3713652
           2
             0
DP                  A         L *\003
3                                                               
           *\003
4                                    0
DP                  NANYER           *\003
5                                                               
           *\003
6                                     GABRIELA                 ANASI           *\003
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
2             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
3             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
4             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
5             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
6             ÐÆ *\003       "ADDR     ,"\001      *\003          
           <N
  TOTAL_PURCHASE_AMT.
1                \001
2                \001
3                \001
4                \001
5                \001
6                \001

4) Lastly read.table.ffdf like

library(ff) 
library(stringr) 
my.data1  <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
                            FUN="read.fwf", 
                            widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
                            header=F, VERBOSE=TRUE, 
                            col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
                                          "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                                          " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                                          "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), 
                            fileEncoding = "UTF-8", 
                            transFUN=function(x){ 
                              z <- sapply(x, function(y) { 
                                y <- str_trim(y) 
                                y[y==""] <- NA 
                                factor(y)}) 
                              as.data.frame(z) 
                            } )

But result is same. The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].

What am I doing wrong , am I putting the widths wrong ? Or I am getting the ideas wrong altogether? I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.

MichaelChirico · Accepted Answer

Update

Please see here for what I use nowadays for this problem:

Faster way to read fixed-width files

for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅

Here's the FW -> .csv converter I created in Python to destroy these awful files:

It also includes the checkLength function that can help get at what @RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).

Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)

import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS

def checkLength(ffile):
    """
    Used to check that all lines in file have the same length (and so don't cause any issues below)
    """
    with open(ffile,'r') as ff:
        firstrow=1
        troubles=0
        for rows in ff:
            if firstrow:
                length=len(rows)
                firstrow=0
            elif len(rows) != length:
                print rows
                print len(rows)
                troubles=1
    return troubles

def fixed2csv(infile,outfile,dictfile):
    """
    This function takes a file name for a fixed-width dataset as input and 
    converts it to .csv format according to slices and column names specified in dictfile

    Parameters
    ==========
        infile: string of input file name from which fixed-width data is to be read
                e.g. 'fixed_width.dat'
        outfile: string of output file name to which comma-separated data is to be saved
                 e.g. 'comma_separated.csv'
        dictfile: .csv-formatted dictionary file name from which to read the following:
                     * widths: field widths
                     * column names: names of columns to be written to the output .csv
                     * types: object types (character, integer, etc)
                  column order must be: col_names,slices,types
    """
    with open(dictfile,'r') as dictf:
        fieldnames = ("col_names","widths","types") #types used in R later
        ddict = csv.DictReader(dictf,fieldnames)
        slices=[]
        colNames=[]
        wwidths=[]
        for rows in ddict:
            wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
            colNames.append(rows['col_names'])
        offset = 0
        for w in wwidths:
            slices.append(slice(offset,offset+w))
            offset+=w
    with open(infile,'r') as fixedf:
        with open(outfile,'w') as csvf:
            csvfile=csv.writer(csvf)
            csvfile.writerow(colNames)
            for rows in fixedf:
                csvfile.writerow([rows[s] for s in slices])

Good luck, and curses be on whoever it is that is proliferating these FW format data files.

Reading a fixed length text file in R

Tags:

r

Sayak

1 Answers

Update

MichaelChirico

Recent Activity

Donate For Us

Reading a fixed length text file in R

Tags:

r

Sayak

1 Answers

Update

MichaelChirico

Related questions

Recent Activity

Donate For Us