Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a fixed length text file in R

Tags:

r

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R. The File looks some what like this :

    DP                  JAMES                    SILVA                    REY                                                                                                                                                             




       2014
           6
            0
                1723713652
           2
             0
DP                  ALEJANDRA                                         NARVAEZ                                                                                                                                                         




       2014
           6
            0
                1723713456
           6
             0
DP                  NANYER                                            PICHARDO                                                                                                                                                        




       2014
           6
            0
                1723713991
           1
             0
DP                  GABRIELA                 ANASI                    CASTILLO                                                                                                                                                        




       2014
           6
            0
                1723713240
           3
             0

It is not clear here , I have attached it , please find.

It reads easily in SAS using infile input.

SAS Code:

infile  "filename.txt" lrecl=32767 ;

input

@001 park_cd      $5.

@006 Title $15.

@021 first_name  $25.

@046 middle_name $25.

@071 last_name   $25.

@096 suffix      $15.

@111 ADDRESS_1   $60.

@171 ADDRESS_2   $60.

@231 ADDRESS_3   $60.

@261 CITY  $30.

@291 STATE_PROVINCE    $2.

@293  ZIP    $9.

@302 Ticket_Year  $11.

@314 product_id  $12.

@327 UNIT_PRICE  $13.

@340 PURCHASE_DT $26.

@366 PURCHASE_QTY $12.

@378 TOTAL_PURCHASE_AMT $14. ;

run;

Now to do the same in R , I have been trying various things:

1)Atfirst read.fwf , Code:

dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
 widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
header = FALSE, sep = "\t",fill = TRUE,
 skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
   "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
    " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
       "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")

But it returns just NA values for most of the fields and only some values that too in wrong positions.

Head(dat1) gives output:

  park_cd           Title                first_name               middle_name
1   DP                    JAMES                     SILVA                    
2                                                                            
3                                                                        <NA>
4                    <NA>                      <NA>                      <NA>
5                                              <NA>                      <NA>
6                    2014                      <NA>                      <NA>
                  last_name          suffix
1 REY                                      
2                      <NA>            <NA>
3                      <NA>            <NA>
4                      <NA>            <NA>
5                      <NA>            <NA>
6                      <NA>            <NA>
                                                    ADDRESS_1.
1                                                             
2                                                         <NA>
3                                                         <NA>
4                                                         <NA>
5                                                         <NA>
6                                                         <NA>
                                                     ADDRESS_2 ADDRESS_3 CITY
1                                                                     NA   NA
2                                                         <NA>        NA   NA
3                                                         <NA>        NA   NA
4                                                         <NA>        NA   NA
5                                                         <NA>        NA   NA
6                                                         <NA>        NA   NA
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             NA    NA          NA         NA         NA          NA           NA
2             NA    NA          NA         NA         NA          NA           NA
3             NA    NA          NA         NA         NA          NA           NA
4             NA    NA          NA         NA         NA          NA           NA
5             NA    NA          NA         NA         NA          NA           NA
6             NA    NA          NA         NA         NA          NA           NA
  TOTAL_PURCHASE_AMT.
1                  NA
2                  NA
3                  NA
4                  NA
5                  NA
6                  NA

Output:

2) Now I use the Sascii package to call the SAS code in R. Code:

sas_imp <- "input
@001 park_cd      $5.
@006 Title $15.
@021 first_name  $25.
@046 middle_name $25.
@071 last_name   $25.
@096 suffix      $15.
@111 ADDRESS_1   $60.
@171 ADDRESS_2   $60.
@231 ADDRESS_3   $60.
@261 CITY  $30.
@291 STATE_PROVINCE    $2.
@293  ZIP    $9.
@302 Ticket_Year  $11.
@314 product_id  $12.
@327 UNIT_PRICE  $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf ) 

It too gives the same useless output as above.

3) Now I use the Laf package and the laf_open_fwf command like :

library(LaF)

data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
                     column_types=rep("character",18),
  column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
                  "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                  " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                  "PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
                     column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))

Then I convert it like :

library(ffbase)
my.data <- laf_to_ffdf(data) 
head(as.data.frame(my.data))

But it gives output:

park_cd Title first_name             middle_name                      last_name
1      DP            JAMES                   SILVA                            REY
2                                             \r\n                           \r\n
3   JANDR     A                            NARVAEZ                               
4                     \r\n                      \r \n  \r\n         \r\n       20
5                 PICHARDO                                                       
6          \r\n            \r\n  \r\n         \r\n       2014\r\n           6\r\n
             suffix
1                  
2 \r\n         \r\n
3                  
4            14\r\n
5                  
6             0\r\n
                                                            ADDRESS_1.
1                                                                     
2         2014\r\n           6\r\n            0\r\n                172
3                                                                     
4 6\r\n            0\r\n                1723713456\r\n           6\r\n
5                                                                     
6                   1723713991\r\n           1\r\n             0\r\nDP
                                                           ADDRESS_2 ADDRESS_3  CITY
1                                                                           \r *\003
2 3713652\r\n           2\r\n             0\r\nDP                  A         L *\003
3                                                               \r\n           *\003
4                                    0\r\nDP                  NANYER           *\003
5                                                               \r\n           *\003
6                                     GABRIELA                 ANASI           *\003
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
2             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
3             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
4             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
5             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
6             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
  TOTAL_PURCHASE_AMT.
1                \001
2                \001
3                \001
4                \001
5                \001
6                \001

4) Lastly read.table.ffdf like

library(ff) 
library(stringr) 
my.data1  <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
                            FUN="read.fwf", 
                            widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
                            header=F, VERBOSE=TRUE, 
                            col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
                                          "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                                          " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                                          "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), 
                            fileEncoding = "UTF-8", 
                            transFUN=function(x){ 
                              z <- sapply(x, function(y) { 
                                y <- str_trim(y) 
                                y[y==""] <- NA 
                                factor(y)}) 
                              as.data.frame(z) 
                            } )

But result is same. The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].

What am I doing wrong , am I putting the widths wrong ? Or I am getting the ideas wrong altogether? I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.

like image 932
Sayak Avatar asked Oct 14 '25 10:10

Sayak


1 Answers

Update

Please see here for what I use nowadays for this problem:

Faster way to read fixed-width files

for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅


Here's the FW -> .csv converter I created in Python to destroy these awful files:

It also includes the checkLength function that can help get at what @RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).

Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)

import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS

def checkLength(ffile):
    """
    Used to check that all lines in file have the same length (and so don't cause any issues below)
    """
    with open(ffile,'r') as ff:
        firstrow=1
        troubles=0
        for rows in ff:
            if firstrow:
                length=len(rows)
                firstrow=0
            elif len(rows) != length:
                print rows
                print len(rows)
                troubles=1
    return troubles

def fixed2csv(infile,outfile,dictfile):
    """
    This function takes a file name for a fixed-width dataset as input and 
    converts it to .csv format according to slices and column names specified in dictfile

    Parameters
    ==========
        infile: string of input file name from which fixed-width data is to be read
                e.g. 'fixed_width.dat'
        outfile: string of output file name to which comma-separated data is to be saved
                 e.g. 'comma_separated.csv'
        dictfile: .csv-formatted dictionary file name from which to read the following:
                     * widths: field widths
                     * column names: names of columns to be written to the output .csv
                     * types: object types (character, integer, etc)
                  column order must be: col_names,slices,types
    """
    with open(dictfile,'r') as dictf:
        fieldnames = ("col_names","widths","types") #types used in R later
        ddict = csv.DictReader(dictf,fieldnames)
        slices=[]
        colNames=[]
        wwidths=[]
        for rows in ddict:
            wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
            colNames.append(rows['col_names'])
        offset = 0
        for w in wwidths:
            slices.append(slice(offset,offset+w))
            offset+=w
    with open(infile,'r') as fixedf:
        with open(outfile,'w') as csvf:
            csvfile=csv.writer(csvf)
            csvfile.writerow(colNames)
            for rows in fixedf:
                csvfile.writerow([rows[s] for s in slices])

Good luck, and curses be on whoever it is that is proliferating these FW format data files.

like image 165
MichaelChirico Avatar answered Oct 17 '25 01:10

MichaelChirico