I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead of time. The keys also need to be stored as a column. I have a method below to construct the table row by row - is there another method that is faster? For context, I want to parse a large dictionary into a pyarrow table to write out to a parquet file. RAM usage is less of a concern than CPU time. I'd prefer not to drop down to the arrow C++ API.
import pyarrow as pa
import random
import string
import time
large_dict = dict()
for i in range(int(1e6)):
large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))
schema = pa.schema({
"key" : pa.uint32(),
"col1" : pa.uint8(),
"col2" : pa.string()
})
start = time.time()
tables = []
for key, item in large_dict.items():
val1, val2 = item
tables.append(
pa.Table.from_pydict({
"key" : [key],
"col1" : [val1],
"col2" : [val2]
}, schema = schema)
)
table = pa.concat_tables(tables)
end = time.time()
print(end - start) # 22.6 seconds on my machine
Since the schema is known ahead of time, you can make a list for each column and build a dictionary of column name and column values pairs.
%%timeit -r 10
import pyarrow as pa
import random
import string
import time
large_dict = dict()
for i in range(int(1e6)):
large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))
schema = pa.schema({
"key" : pa.uint32(),
"col1" : pa.uint8(),
"col2" : pa.string()
})
keys = []
val1 = []
val2 = []
for k, (v1, v2) in large_dict.items():
keys.append(k)
val1.append(v1)
val2.append(v2)
table = pa.Table.from_pydict(
dict(
zip(schema.names, (keys, val1, val2))
),
schema=schema
)
2.92 s ± 236 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
I am playing with pyarrow as well. For me it seems that in your code data-preparing stage (random, etc) is most time consuming part itself. So may be first try to convert data into dict of arrays, and then feed them to Arrow Table.
Please look, I make example based on your data and %%timeit
-ing only Table population stage. But do it with RecordBatch.from_arrays()
and array of three arrays.
I = iter(pa.RecordBatch.\
from_arrays(
get_data(l0, l1_0, l2, i),
schema=schema) for i in range(1000)
)
T1 = pa.Table.from_batches(I, schema=schema)
With static data set 1000 rows batched 1000 times - table is populated with incredible 15 ms :) Due to caching maybe. And with 1000 rows modified like col1*integer batched 1000 times - 33.3 ms, which is also looks nice.
My sample notebook is here
PS. I was wondering could be numba jit be helpful, but seems it only making timing worse here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With