Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas cannot read parquet files created in PySpark

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

This creates a folder with multiple files in it.

When I try to read this into pandas, I get the following errors, depending on which parser I use:

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status

ArrowIOError: Invalid parquet file. Corrupt footer.

fastparquet:

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)

PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

I am using the following versions:

  • Spark 2.4.0
  • Pandas 0.23.4
  • pyarrow 0.10.0
  • fastparquet 0.2.1

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

It would already help if somebody was able to reproduce this error.

like image 784
Thomas Avatar asked Jan 15 '19 15:01

Thomas


People also ask

How do I read a Pyspark Parquet file?

Spark SQL provides support for both the reading and the writing Parquet files which automatically capture the schema of original data, and it also reduces data storage by 75% on average. By default, Apache Spark supports Parquet file format in its library; hence, it doesn't need to add any dependency libraries.

How do I add parquet files to pandas?

Pandas to_parquet can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.

How to read parquet data from a folder into pandas?

The function read_parquet_as_pandas () can be used if it is not known beforehand whether it is a folder or not. If the parquet file has been created with spark, (so it's a directory) to import it to pandas use from pyarrow.parquet import ParquetDataset dataset = ParquetDataset ("file.parquet") table = dataset.read () df = table.to_pandas ()

How do I read a Parquet file in spark?

Load a parquet object from the file path, returning a DataFrame. If not None, only these columns will be read from the file. Index column of table in Spark. If True, try to respect the metadata if the Parquet file is written from pandas. All other options passed directly into Spark’s data source.

How to write Dataframe to Parquet file in pyspark?

Pyspark Write DataFrame to Parquet file format Now let’s create a parquet file from PySpark DataFrame by calling the parquet () function of DataFrameWriter class. When you write a DataFrame to parquet file, it automatically preserves column names and their data types. Each part file Pyspark creates has the.parquet file extension.

How can I read multiple parquet files in Python?

Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python Thank you for your answer. It seems that reading single files (your second bullet point) works.


3 Answers

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

You can circumvent this issue in different ways:

  • Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).

      arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
      arrow_table = arrow_dataset.read()
      pandas_df = arrow_table.to_pandas()
    
  • Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

like image 93
martinarroyo Avatar answered Oct 19 '22 00:10

martinarroyo


Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

import pandas as pd
import datetime
import os

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

like image 41
Thomas Avatar answered Oct 19 '22 01:10

Thomas


If the parquet file has been created with spark, (so it's a directory) to import it to pandas use

from pyarrow.parquet import ParquetDataset

dataset = ParquetDataset("file.parquet")
table = dataset.read()
df = table.to_pandas()
like image 2
Galuoises Avatar answered Oct 19 '22 01:10

Galuoises