Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe.

I do

bucket = 's3://some_bucket/test/usages'

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

read_pq = pq.ParquetDataset(bucket, filesystem=s3).read_pandas()

When I do read_pq, I get

pyarrow.Table
_COL_0: decimal(9, 0)
_COL_1: decimal(9, 0)
_COL_2: decimal(9, 0)
_COL_3: decimal(9, 0)

When I do df = read_pd.to_pandas(); df.dtypes, I get

_COL_0    object
_COL_1    object
_COL_2    object
_COL_3    object
dtype: object

The original data are all integers. When I operate on the objects in the pandas dataframe, the operations are very slow.

  • How can I convert the parquet columns to a format that will be read as an int or as a float in pandas?
  • Or is it best to operate on the pandas dataframe as above and use pd.to_numeric or similar?
  • Or is there an issue with the original dataformat decimal(9, 0)?

Or is it best to convert on the pandas dataframe directly?

I tried: read_pq.column('_COL_0').cast('int32') throws an error like

No cast implemented from decimal(9, 0) to int32
like image 336
clog14 Avatar asked Feb 25 '19 12:02

clog14


1 Answers

Pandas is funny about integers and such. From what I understand in reading pandas documentation, Pandas does not really seem to have a concept of int versus float and mostly works in float values.

In this situation I would go ahead and use astype to start working with your data like this:

df['_COL_0'] = df['_COL_0'].astype(float)

If they are truly all integers then you should be able to use this simple for loop to cast all the pandas series (columns) to float values like so:

for col in df.columns:
  df[col] = df[col].astype(float)

Let me know if this works for you, I just ran a test in my Jupyter NoteBook and it seemed to work out.

like image 59
git_rekt Avatar answered Nov 04 '22 03:11

git_rekt