Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a formated binary file from a Pandas Dataframe

I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.

like image 741
jbssm Avatar asked Oct 13 '14 20:10

jbssm


1 Answers

Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.

At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.

Example of using to_hdf:

df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')

I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.

In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.

like image 133
Josiah Yoder Avatar answered Nov 14 '22 22:11

Josiah Yoder