I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column <code>dtypes</code> as follows <pre class="prettyprint"><code>_dtype = {"column_1": "float64", "column_2": "category", "column_3": "int64", "column_4": "int64"} df = pd.read_csv("data.csv", dtype=_dtype) </code></pre> I then do some more data cleaning and write the data out into Parquet for downstream use. <pre class="prettyprint"><code>_parquet_kwargs = {"engine": "pyarrow", "compression": "snappy", "index": False} df.to_parquet("data.parquet", **_parquet_kwargs) </code></pre> But when I read the data into Pandas for further analysis using <code>from_parquet</code> I can not seem to recover the category dtypes. The following <pre class="prettyprint"><code>df = pd.read_parquet("data.parquet") </code></pre> results in a <code>DataFrame</code> with <code>object</code> dtypes in place of the desired <code>category</code>. The following seems to work as expected <pre class="prettyprint"><code>import pyarrow.parquet as pq _table = (pq.ParquetFile("data.parquet") .read(use_pandas_metadata=True)) df = _table.to_pandas(strings_to_categorical=True) </code></pre> however I would like to know how this can be done using <code>pd.read_parquet</code>.

This is fixed in <code>Arrow 0.15</code>, now the next code keeps the columns as categories (and the performance is significantly faster): <pre class="prettyprint lang-py prettyprint-override"><code>import pandas df = pandas.DataFrame({'foo': list('aabbcc'), 'bar': list('xxxyyy')}).astype('category') df.to_parquet('my_file.parquet') df = pandas.read_parquet('my_file.parquet') df.dtypes </code></pre>

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

Tags:

python-3.x

pandas

parquet

pyarrow

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows

_dtype = {"column_1": "float64",
          "column_2": "category",
          "column_3": "int64",
          "column_4": "int64"}

df = pd.read_csv("data.csv", dtype=_dtype)

I then do some more data cleaning and write the data out into Parquet for downstream use.

_parquet_kwargs = {"engine": "pyarrow",
                   "compression": "snappy",
                   "index": False}

df.to_parquet("data.parquet", **_parquet_kwargs)

But when I read the data into Pandas for further analysis using from_parquet I can not seem to recover the category dtypes. The following

df = pd.read_parquet("data.parquet")

results in a DataFrame with object dtypes in place of the desired category.

The following seems to work as expected

import pyarrow.parquet as pq

_table = (pq.ParquetFile("data.parquet")
            .read(use_pandas_metadata=True))

df = _table.to_pandas(strings_to_categorical=True)

however I would like to know how this can be done using pd.read_parquet.

495

asked Feb 17 '19 08:02

davidrpugh

1 Answers

This is fixed in Arrow 0.15, now the next code keeps the columns as categories (and the performance is significantly faster):

import pandas

df = pandas.DataFrame({'foo': list('aabbcc'),
                       'bar': list('xxxyyy')}).astype('category')

df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes

125

answered Oct 06 '22 22:10

Marc Garcia

Related questions
                            
                                Why `pyvenv` does not install `python-config`?
                            
                                Writing cross-compatible Python 2/3: Difference between __future__, six, and future.utils?
                            
                                How restart Scrapy spider
                            
                                convert pandas dataframe to utf8
                            
                                Python Multiprocessing - Why are my processes are not returning/finishing?
                            
                                Tkinter Treeview heading styling
                            
                                x11 - ImportError: No module named 'kivy.core.window.window_x11'
                            
                                Extract IPs and Ports from a list in Python 3.x
                            
                                Using Python 3 in Azure Functions
                            
                                Asynchronous HTTP calls using aiohttp/asyncio fail with "Cannot connect to host [Network is unreachable]" [duplicate]
                            
                                Creating a minimal graph representing all combinations of 3-bit binary strings
                            
                                Wait subprocess.run until completes its task
                            
                                python is operator behaviour with string [duplicate]
                            
                                Errors while dynamic imports using importlib in python3
                            
                                ValueError: Found input variables with inconsistent numbers of samples: [2750, 1095]
                            
                                Difference between django.core serializers and Django Rest Framework serializers
                            
                                How to update DynamoDB table with DICT data type (boto3)
                            
                                Difference between ax.set_xlabel() and ax.xaxis.set_label() in MatplotLib 3.0.1
                            
                                How to send an PIL Image via telegram bot without saving it to a file
                            
                                Build graph of organizational structure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With