I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted: please see my "csv" below:
time,data
01-11-19 10:11:56.132,xxx
Please see my code sample below.
import pyarrow as pa
from pyarrow import csv
from pyarrow import parquet
convert_dict = {
'time': pa.timestamp('us', None),
'data': pa.string()
}
convert_options = csv.ConvertOptions(
column_types=convert_dict
, strings_can_be_null=True
, quoted_strings_can_be_null=True
, timestamp_parsers=['%d-%m-%y %H:%M:%S.%f']
)
table = csv.read_csv('test.csv', convert_options=convert_options)
print(table)
parquet.write_table(table, 'test.parquet')
Basically, pyarrow doesn't like some strptime values. Specifically in this case, it does not like "%f" which is for fractional seconds (https://www.geeksforgeeks.org/python-datetime-strptime-function/). Any help to get pyarrow to do what I need would be appreciated.
Just to be clear, I can get the code to run if I edit the data to not have fractional seconds and then remove the "%f" from the timestamp_parsers option. However I need to maintain the integrity of the data so this is not an option. To me it seems like a bug in pyarrow or I'm an idiot and missing something obvious. Open to both options just want to know which it is.
%f
is not supported in pyarrow and most likely won't be as it's a Python specific flag. See discussion here: https://issues.apache.org/jira/browse/ARROW-15883 . PRs are of course always welcome!
As a workaround you could first read timestamps as strings, then process them by slicing off the fractional part and add that as pa.duration to processed timestamps:
import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With