Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead of time. The keys also need to be stored as a column. I have a method below to construct the table row by row - is there another method that is faster? For context, I want to parse a large dictionary into a pyarrow table to write out to a parquet file. RAM usage is less of a concern than CPU time. I'd prefer not to drop down to the arrow C++ API.

import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
   })

start = time.time()

tables = []
for key, item in large_dict.items():
    val1, val2 = item
    tables.append(
            pa.Table.from_pydict({
                    "key"  : [key],
                    "col1" : [val1],
                    "col2" : [val2]
                }, schema = schema)

            )

table = pa.concat_tables(tables)
end = time.time()
print(end - start) # 22.6 seconds on my machine

like image 480
Josh W. Avatar asked Sep 14 '19 20:09

Josh W.


2 Answers

Since the schema is known ahead of time, you can make a list for each column and build a dictionary of column name and column values pairs.

%%timeit -r 10
import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
  })

keys = []
val1 = []
val2 = []
for k, (v1, v2) in large_dict.items():
  keys.append(k)
  val1.append(v1)
  val2.append(v2)

table = pa.Table.from_pydict(
    dict(
        zip(schema.names, (keys, val1, val2))
    ),
    schema=schema
)

2.92 s ± 236 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

like image 137
Oluwafemi Sule Avatar answered Nov 14 '22 04:11

Oluwafemi Sule


I am playing with pyarrow as well. For me it seems that in your code data-preparing stage (random, etc) is most time consuming part itself. So may be first try to convert data into dict of arrays, and then feed them to Arrow Table.

Please look, I make example based on your data and %%timeit-ing only Table population stage. But do it with RecordBatch.from_arrays() and array of three arrays.

I = iter(pa.RecordBatch.\
         from_arrays(
                      get_data(l0, l1_0, l2, i),
                      schema=schema) for i in range(1000)
        )

T1 = pa.Table.from_batches(I, schema=schema)

With static data set 1000 rows batched 1000 times - table is populated with incredible 15 ms :) Due to caching maybe. And with 1000 rows modified like col1*integer batched 1000 times - 33.3 ms, which is also looks nice.

My sample notebook is here

PS. I was wondering could be numba jit be helpful, but seems it only making timing worse here.

like image 27
Dima Fomin Avatar answered Nov 14 '22 04:11

Dima Fomin