The traditional way to save a numpy object to parquet is to use Pandas as an intermediate. However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.
I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .
The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.
# Download sample file, about 10 mbs
from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF = pd.read_pickle('sample.pkl')
sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )
# Parquet file takes up little space
os.path.getsize('test1.pqt')
6594712
getsizeof(sampleDF)
22827172
sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))
#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)
22401764
#Much less RAM as a numpy object
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)
112
# Much more space in .npy form
np.save( 'test2.npy', sampleNumpy)
os.path.getsize('test2.npy')
20825382
# Numpy savez. Not as good as parquet
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')
9873964
CSV to Parquet Using PyArrow Internally, Pandas' to_parquet() uses the pyarrow module. You can do the conversion from CSV to Parquet directly in pyarrow usinq parquet. write_table() . This removes one level of indirection, so it's slightly more efficient.
NumPy is popularly used for numerical calculations. Pandas provide support for working with tabular data- CSV, Excel etc. NumPy by default support data in the form of arrays and matrix. Pandas series and dataframes cannot be directly fed as input in these toolkits.
You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format. You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma.
You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas. Note that parquet is a tabular format, so creating some table is still necessary.
import numpy as np
import pyarrow as pa
np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")
refs: numpy to pyarrow, pyarrow.parquet.write_table
Parquet format can be written using pyarrow
, the correct import syntax is:
import pyarrow.parquet as pq
so you can use pq.write_table
. Otherwise using import pyarrow as pa, pa.parquet.write_table
will return: AttributeError: module 'pyarrow' has no attribute 'parquet'
.
Pyarrow requires the data to be organized columns-wise, which means in the case of numpy
multidimensional arrays, you need to assign each dimension to a specific field in the parquet
column.
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
ndarray = np.array(
[
[4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
[4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
[4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
]
)
ndarray_table = pa.table(
{
"X": ndarray[:, 0],
"Y": ndarray[:, 1],
"Z": ndarray[:, 2],
"Amp": ndarray[:, 3],
"Ang": ndarray[:, 4],
}
)
pq.write_table(ndarray_table, "ndarray.parquet")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With