Fastest way to write numpy array in arrow format

Tags:

I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains 1.000.000.000 integers of dtype = np.uint16.

import pyarrow as pa
import numpy as np

def write(arr, name):
    arrays = [pa.array(col) for col in arr]
    names = [str(i) for i in range(len(arrays))]
    batch = pa.RecordBatch.from_arrays(arrays, names=names)
    with pa.OSFile(name, 'wb') as sink:
        with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
            writer.write_batch(batch)

def read(name):
    source = pa.memory_map(name, 'r')
    table = pa.ipc.RecordBatchStreamReader(source).read_all()
    for i in range(table.num_columns):
        yield table.column(str(i)).to_numpy()

arr = np.random.randint(65535, size=(250, 4000000), dtype=np.uint16)

%%timeit -r 1 -n 1
write(arr, 'test.arrow')
>>> 25.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -r 1 -n 1
for n in read('test.arrow'): n
>>> 901 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Can efficiency of writing to .arrow format be improved? In addition, I tested np.save:

%%timeit -r 1 -n 1
np.save('test.npy', arr)
>>> 18.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

It looks a little bit faster. Can we optimise Apache Arrow for better writing into .arrow format further?

700

asked Nov 09 '21 16:11

mathfux

2 Answers

It may be the case that the performance issue is mainly due to IO/disk speed. In this case, there isn't much you can improve.

I ran a few tests on my device. The numbers I get are different from yours. But the bottom line is the same, writing is slower than reading.

The resulting file is 1.9 GB (2000023184 bytes):

$ ls test.arrow -l
-rw-rw-r-- 1 0x26res 0x26res 2000023184 Nov 15 10:01 test.arrow

In the code below I generate 1.9 GB of random bytes, and save them, then compare to the time it took to save with arrow:

import secrets

data = b"\x00" + secrets.token_bytes(2000023184)  + b"\x00"

def write_bytes(data, name):
    with open(name, 'wb') as fp:
        fp.write(data)

%%timeit -r 1 -n 1 write_bytes(data, 'test.bytes')
>>> 2.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -r 1 -n 1 write(arr, 'test.arrow')
>>> 2.52 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

On my device, it takes 2.52 seconds to write the data using arrow. If I try to write that amount of random bytes it takes 2.29 seconds. It means the overhead or arrow is about 10% of the write time, so there isn't much that can be done to speed it up.

166

answered Oct 20 '22 01:10

0x26res

Indeed, it appears to be some kind of limitations of my RAM/IO/disk. Very silent ones... It slows down my writing 3 - 8 times after arr exceeds 200M items and that's why I'm experiencing a drop of speed from 2.5 seconds to 20. I would be glad to know if this could be resolved in pyarrow.

def pyarrow_write_arrow_Batch(arr, name):
    arrays = [pa.array(col) for col in arr]
    names = [str(i) for i in range(len(arrays))]
    batch = pa.RecordBatch.from_arrays(arrays, names=names)
    with pa.OSFile(name, 'wb') as sink:
        with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
            writer.write_batch(batch)

%matplotlib notebook
import benchit
benchit.setparams(environ='notebook')
benchit.setparams(rep=5)

arr = np.random.randint(65535, size=(int(1e9),), dtype=np.uint16)
size = [4, 8, 12, 20, 32, 48, 64, 100, 160, 256, 400, 600, 1000]

def pwa_Batch_10000(arr, name): return pyarrow_write_arrow_Batch(arr.reshape(-1, 10000), name)
def pwa_Batch_100000(arr, name): return pyarrow_write_arrow_Batch(arr.reshape(-1, 100000), name)
def pwa_Batch_1000000(arr, name): return pyarrow_write_arrow_Batch(arr.reshape(-1, 1000000), name)
def pwa_Batch_4000000(arr, name): return pyarrow_write_arrow_Batch(arr.reshape(-1, 4000000), name)

fns = [pwa_Batch_10000, pwa_Batch_100000, pwa_Batch_1000000, pwa_Batch_4000000]
in_ = {s: (arr[:s*int(1e6)], 'test.arrow') for s in size}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of items')
t.plot(logx=True, figsize=(8,4), fontsize=10)

enter image description here

answered Oct 20 '22 03:10

mathfux

Related questions
                            
                                What is the "nomkl" Python package used for?
                            
                                Document Layout Analysis for text extraction
                            
                                Pandas normalise by column on groupby
                            
                                Python/pip process are killed in virtualenv (Apple M1 chip)
                            
                                KeyError caused by a wrong attribute assignment within the network?
                            
                                Exporting a Geopandas dataframe to a zipped shapefile directly
                            
                                Problem with combining fastapi with plotly.dash and adding token dependency as auth
                            
                                Matplotlib directory not found while using Pyinstaller to create exe from py files
                            
                                Only keep df column values that contain a string from list of string
                            
                                WARNING: xcodeproj is not installed or is not configured properly
                            
                                psycopg2.errors.ActiveSqlTransaction: CREATE TABLESPACE cannot run inside a transaction block
                            
                                Why does searching for larger strings in a `reversed` string take more time than slice reversing?
                            
                                error_code":403,"description":"Forbidden: bot was blocked by the user. error handle in python
                            
                                AssertionError: Tried to export a function which references untracked resource
                            
                                Explosion of memory when using pandas .loc with umatching indices + assignment giving duplicate axis error
                            
                                Distinguishing between Pydantic Models with same fields
                            
                                Why won't mypy understand this object instantiation?
                            
                                Why can't add file handler with the form of self.fh in the init method?
                            
                                How to obtain smooth histogram after scaling image?
                            
                                PytzUsageWarning: The zone attribute is specific to pytz's interface; please migrate to a new time zone provider

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to write numpy array in arrow format

Tags:

python

numpy

pyarrow

mathfux

People also ask

2 Answers

0x26res

mathfux

Recent Activity

Donate For Us