I am writing a parquet file from a Spark DataFrame the following way: <pre class="prettyprint"><code>df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") </code></pre> This creates a folder with multiple files in it. When I try to read this into pandas, I get the following errors, depending on which parser I use: <pre class="prettyprint"><code>import pandas as pd df = pd.read_parquet("path/myfile.parquet", engine="pyarrow") </code></pre> PyArrow: <blockquote> File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status ArrowIOError: Invalid parquet file. Corrupt footer. </blockquote> fastparquet: <blockquote> File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode) PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet' </blockquote> I am using the following versions: <ul> <li>Spark 2.4.0</li> <li>Pandas 0.23.4</li> <li>pyarrow 0.10.0</li> <li>fastparquet 0.2.1</li> </ul> I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write. It would already help if somebody was able to reproduce this error.

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path. You can circumvent this issue in different ways: <ul> <li> Reading the file with an alternative utility, such as the <code>pyarrow.parquet.ParquetDataset</code>, and then convert that to Pandas (I did not test this code). <pre class="prettyprint"><code> arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet') arrow_table = arrow_dataset.read() pandas_df = arrow_table.to_pandas() </code></pre> </li> <li> Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python </li> </ul>

If the parquet file has been created with spark, (so it's a directory) to import it to pandas use <pre class="prettyprint"><code>from pyarrow.parquet import ParquetDataset dataset = ParquetDataset("file.parquet") table = dataset.read() df = table.to_pandas() </code></pre>

Pandas cannot read parquet files created in PySpark

Tags:

python

pandas

apache-spark

pyspark

parquet

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

This creates a folder with multiple files in it.

When I try to read this into pandas, I get the following errors, depending on which parser I use:

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status

ArrowIOError: Invalid parquet file. Corrupt footer.

fastparquet:

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)

PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

I am using the following versions:

Spark 2.4.0
Pandas 0.23.4
pyarrow 0.10.0
fastparquet 0.2.1

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

It would already help if somebody was able to reproduce this error.

784

asked Jan 15 '19 15:01

Thomas

3 Answers

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

You can circumvent this issue in different ways:

Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).

  arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
  arrow_table = arrow_dataset.read()
  pandas_df = arrow_table.to_pandas()

Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

answered Oct 19 '22 00:10

martinarroyo

Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

import pandas as pd
import datetime
import os

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

answered Oct 19 '22 01:10

Thomas

If the parquet file has been created with spark, (so it's a directory) to import it to pandas use

from pyarrow.parquet import ParquetDataset

dataset = ParquetDataset("file.parquet")
table = dataset.read()
df = table.to_pandas()

answered Oct 19 '22 01:10

Galuoises

Related questions
                            
                                Slice operator with end index 0 [duplicate]
                            
                                How to hide google map api key in django before pushing it on github?
                            
                                Group by a column to find the most frequent value in another column? [duplicate]
                            
                                KeyError: <class 'pandas._libs.tslibs.timestamps.Timestamp'> when saving dataframe to excel
                            
                                loss calculation over different batch sizes in keras
                            
                                How to generate dynamic function name and call it using user input in Python
                            
                                pandas add a column with only one row
                            
                                Pandas, convert datetime format mm/dd/yyyy to dd/mm/yyyy
                            
                                Splitting a string into list and converting the items to int
                            
                                python - Variable scope after using a 'with' statement [duplicate]
                            
                                python-gdb error: Python Exception <class 'RuntimeError'> Type does not have a target
                            
                                Flask - Get the name of an uploaded file minus the file extension
                            
                                Convert list of dicts to CSV in Python 3
                            
                                Python execute playsound in separate thread
                            
                                Why do we need __init__ to initialize a python class
                            
                                Basic pattern recognition in binary (pixelated) image
                            
                                Why do I get error while trying to build an architecture with multiple inputs in Keras?
                            
                                How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)
                            
                                Pandas: merge data frame but summing overlapping columns
                            
                                How to store functions as class variables in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas cannot read parquet files created in PySpark

Tags:

python

pandas

apache-spark

pyspark

parquet

Thomas

People also ask

3 Answers

martinarroyo

Thomas

Galuoises

Recent Activity

Donate For Us