I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.
import numpy as np
import pandas as pd
input_file_name = 'test.hst'
input_file = open(input_file_name, 'rb')
header = input_file.read(96)
dt_header = np.dtype([('version', 'i4'),
('copyright', 'S64'),
('symbol', 'S12'),
('period', 'i4'),
('digits', 'i4'),
('timesign', 'i4'),
('last_sync', 'i4')])
header = np.fromstring(header, dt_header)
dt_records = np.dtype([('ctm', 'i4'),
('open', 'f8'),
('low', 'f8'),
('high', 'f8'),
('close', 'f8'),
('volume', 'f8')])
records = np.fromfile(input_file, dt_records)
input_file.close()
df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file
Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.
Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.
At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.
Example of using to_hdf:
df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')
I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.
In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:
The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.
However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With