Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading fixed-width text file from zipfiles into Pandas dataframe

I'm trying to read text files into Pandas dataframes from inside a zipped archive. The files are formatted like this:

System Time       hh:mm:ss           PPS     Zsec(sec)         Hex Message

Yr=17  Mn= 3 Dy= 3

19:22:59.894      19:22:16        52         69736        7E 32 02 4F 02 00 0C 7F 97 68 10 01 00 11 03 03 13 16 10 34 00 00 00 05 02 00 80 00 83 B1 7E
19:24:12.130      19:23:10       106         69790        7E 32 02 4F 02 00 0C 7F 97 9E 10 01 00 11 03 03 13 17 0A 6A 00 00 00 05 12 00 BA 00 47 DF 7E
19:24:13.241      19:23:11       107         69791        7E 32 02 4F 02 00 0C 7F 97 9F 10 01 00 11 03 03 13 17 0B 6B 00 00 00 05 05 00 BC 00 F3 AC 7E

If the file is extracted outside the archive, I can read it:

data = '../data/test1/heartbeat.txt'
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])

But that approach fails if I try to access it inside the zipfile:

zf = zipfile.ZipFile('../data.zip', 'r')
data = zf.open('data/test1/heartbeat.txt')
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])

I see TypeError: cannot use a string pattern on a bytes-like object

If I use delim_whitespace instead of \s{2,} it reads the file. So it seems like I'm using zipfile successfully. However, the 'Hex Message' column contains single spaces, which get broken into many columns in the dataframe.

I've also tried using fixed-width column reading, read_fwf, which also works with the extracted file:

data = '../data/test1/heartbeat.txt'
widths = [13,14,10,13,100]
df = pd.read_fwf(data,widths=widths,skiprows=4,
                 names = ['System Time', 'hh:mm:ss', 'PPS', 'Zsec(sec)','Hex Message'])

But that also fails when the file is inside the zip archive: TypeError: a bytes-like object is required, not 'str'

I'm not sure how translate these bytes-like objects from the zipfile into something the Pandas reader can parse.

like image 960
nlsn Avatar asked Mar 04 '17 16:03

nlsn


People also ask

How do I read a fixed width file in Python?

To efficiently parse fixed width files with Python, we can use the Pandas' read_fwf method. to define the col_specification list with the column specifications for filename. txt. Then we call read.

How do I read a compressed file in pandas?

Method #1: Using compression=zip in pandas. read_csv() method. By assigning the compression argument in read_csv() method as zip, then pandas will first decompress the zip and then will create the dataframe from CSV file present in the zipped file.

How do I read a text file into a DataFrame?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.

How to read fixed width text files with pandas?

Reading fixed width text files with Pandas is easy and accessible. The default parameters for pandas.read_fwf () work in most cases and the customization options are well documented. The Pandas library has many functions to read a variety of file types and the pandas.read_fwf () is one more useful Pandas tool to keep in mind.

How to read a CSV file in pandas Dataframe?

Pandas library has a built-in read_csv () method to read a CSV that is a comma-separated value text file so we can use it to read a text file to Dataframe. It read the file at the given path and read its contents in the dataframe.

How to read a text file in Python using PANDAS?

One can read a text file (txt) by using the pandas read_fwf () function, fwf stands for fixed-width lines, you can use this to read fixed length or variable length text files. Alternatively, you can also read txt file with pandas read_csv () function.

How to create a Dataframe from a zip file in pandas?

It can be installed using the below command: Method #1: Using compression=zip in pandas.read_csv () method. By assigning the compression argument in read_csv () method as zip, then pandas will first decompress the zip and then will create the dataframe from CSV file present in the zipped file. Method #2: Opening the zip file to get the CSV file.


1 Answers

This is working for me:

zf = zipfile.ZipFile('../data.zip', 'r')
data = io.StringIO(zf.read('data/test1/heartbeat.txt').decode('utf_8'))
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])
like image 151
nlsn Avatar answered Sep 29 '22 07:09

nlsn