How to convert Numpy to Parquet without using Pandas?

Tags:

The traditional way to save a numpy object to parquet is to use Pandas as an intermediate. However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.

I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .

The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.

# Download sample file, about 10 mbs

from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')

sampleDF = pd.read_pickle('sample.pkl')

sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )

# Parquet file takes up little space 
os.path.getsize('test1.pqt')

6594712

getsizeof(sampleDF)

22827172

sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))

#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)

22401764

#Much less RAM as a numpy object 
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)

112

# Much more space in .npy form 
np.save( 'test2.npy', sampleNumpy) 
os.path.getsize('test2.npy')

20825382

# Numpy savez. Not as good as parquet 
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')

9873964

599

asked Aug 27 '19 23:08

SantoshGupta7

2 Answers

You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas. Note that parquet is a tabular format, so creating some table is still necessary.

import numpy as np
import pyarrow as pa

np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")

refs: numpy to pyarrow, pyarrow.parquet.write_table

185

answered Sep 30 '22 23:09

TalP

Parquet format can be written using pyarrow, the correct import syntax is:

import pyarrow.parquet as pq so you can use pq.write_table. Otherwise using import pyarrow as pa, pa.parquet.write_table will return: AttributeError: module 'pyarrow' has no attribute 'parquet'.

Pyarrow requires the data to be organized columns-wise, which means in the case of numpy multidimensional arrays, you need to assign each dimension to a specific field in the parquet column.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


ndarray = np.array(
    [
        [4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
        [4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
        [4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
    ]
)

ndarray_table = pa.table(
    {
        "X": ndarray[:, 0],
        "Y": ndarray[:, 1],
        "Z": ndarray[:, 2],
        "Amp": ndarray[:, 3],
        "Ang": ndarray[:, 4],
    }
)

pq.write_table(ndarray_table, "ndarray.parquet")

answered Sep 30 '22 22:09

epifanio

Related questions
                            
                                How to send an PIL Image via telegram bot without saving it to a file
                            
                                How to get the text out of a scrolledtext widget?
                            
                                Moving a Sprite towards player in Pygame (using pygame vectors)
                            
                                Fill order from smaller packages?
                            
                                Test whether list A is contained in list B
                            
                                How to find all node's ancestors in NetworkX?
                            
                                Build graph of organizational structure
                            
                                Why should we use re.purge() in python regular expression?
                            
                                How do you add GeoJsonTooltip to folium.Choropleth class in folium?
                            
                                Unable to read a parquet file
                            
                                How to decompress lzma2 (.xz) and zstd (.zst) files into a folder using Python 3?
                            
                                How do I find the values in my numpy array that are NaN/infinity/too large for dtype('float64')?
                            
                                Bigquery : Create table if not exist and load data using Python and Apache AirFlow
                            
                                How to format a float with a comma as decimal separator in an f-string?
                            
                                Changing Size of Legend in Altair
                            
                                Installing local packages with Python virtualenv --system-site-packages
                            
                                How to profile large datasets with Pandas profiling?
                            
                                Pytest marks: mark entire directory / package
                            
                                Ways to Plot Spark Dataframe without Converting it to Pandas
                            
                                Load testing on an API using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert Numpy to Parquet without using Pandas?

Tags:

python

pandas

numpy

parquet

SantoshGupta7

People also ask

2 Answers

TalP

epifanio

Recent Activity

Donate For Us