I have a problem with filetypes when converting a parquet file to a dataframe.
I do
bucket = 's3://some_bucket/test/usages'
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
read_pq = pq.ParquetDataset(bucket, filesystem=s3).read_pandas()
When I do read_pq
, I get
pyarrow.Table
_COL_0: decimal(9, 0)
_COL_1: decimal(9, 0)
_COL_2: decimal(9, 0)
_COL_3: decimal(9, 0)
When I do df = read_pd.to_pandas(); df.dtypes
, I get
_COL_0 object
_COL_1 object
_COL_2 object
_COL_3 object
dtype: object
The original data are all integers. When I operate on the objects in the pandas dataframe, the operations are very slow.
pd.to_numeric
or similar?decimal(9, 0)
?Or is it best to convert on the pandas dataframe directly?
I tried: read_pq.column('_COL_0').cast('int32')
throws an error like
No cast implemented from decimal(9, 0) to int32
Pandas is funny about integers and such. From what I understand in reading pandas documentation, Pandas does not really seem to have a concept of int versus float and mostly works in float values.
In this situation I would go ahead and use astype to start working with your data like this:
df['_COL_0'] = df['_COL_0'].astype(float)
If they are truly all integers then you should be able to use this simple for loop to cast all the pandas series (columns) to float values like so:
for col in df.columns:
df[col] = df[col].astype(float)
Let me know if this works for you, I just ran a test in my Jupyter NoteBook and it seemed to work out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With