I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes
as follows
_dtype = {"column_1": "float64",
"column_2": "category",
"column_3": "int64",
"column_4": "int64"}
df = pd.read_csv("data.csv", dtype=_dtype)
I then do some more data cleaning and write the data out into Parquet for downstream use.
_parquet_kwargs = {"engine": "pyarrow",
"compression": "snappy",
"index": False}
df.to_parquet("data.parquet", **_parquet_kwargs)
But when I read the data into Pandas for further analysis using from_parquet
I can not seem to recover the category dtypes. The following
df = pd.read_parquet("data.parquet")
results in a DataFrame
with object
dtypes in place of the desired category
.
The following seems to work as expected
import pyarrow.parquet as pq
_table = (pq.ParquetFile("data.parquet")
.read(use_pandas_metadata=True))
df = _table.to_pandas(strings_to_categorical=True)
however I would like to know how this can be done using pd.read_parquet
.
With the query results stored in a DataFrame, we can use petl to extract, transform, and load the Parquet data. In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file.
According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.
You can use pandas to read snppay. parquet files into a python pandas dataframe.
This is fixed in Arrow 0.15
, now the next code keeps the columns as categories (and the performance is significantly faster):
import pandas
df = pandas.DataFrame({'foo': list('aabbcc'),
'bar': list('xxxyyy')}).astype('category')
df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With