Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_fwf not Loading Entire Content of File

I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.

import pandas as pd

file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]

df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M

If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.

import pandas as pd

file_name = r"C:\....\file.txt"

df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M

Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).

Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.

like image 836
eroma934 Avatar asked Dec 11 '14 05:12

eroma934


1 Answers

Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwf into it. This is the code for the generation and reading it:

from random import random

import pandas as pd


file_name = r"/tmp/file.txt"

lines_no = int(30e6)

with open(file_name, 'w') as f:
    for i in range(lines_no):
        if i%int(1e5) == 0:
            print("Writing progress: {:0.1f}%"
                    .format(float(i) / float(lines_no)*100), end='\r')
        f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")


print("File created. Now read it using pd.read_fwf ...")

fwidths = [11,11,11,11,11,11]

df = pd.read_fwf(file_name, widths = fwidths,
               names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])


#print(df)

print(df.shape) #<30M

So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.

The result is : (30000000, 6)

Example file created:

7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905
like image 167
Marcin Avatar answered Oct 18 '22 20:10

Marcin