I am writing a parquet file from a Spark DataFrame the following way:
df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")
This creates a folder with multiple files in it.
When I try to read this into pandas, I get the following errors, depending on which parser I use:
import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")
PyArrow:
File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
ArrowIOError: Invalid parquet file. Corrupt footer.
fastparquet:
File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)
PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'
I am using the following versions:
I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.
It would already help if somebody was able to reproduce this error.
Spark SQL provides support for both the reading and the writing Parquet files which automatically capture the schema of original data, and it also reduces data storage by 75% on average. By default, Apache Spark supports Parquet file format in its library; hence, it doesn't need to add any dependency libraries.
Pandas to_parquet can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
The function read_parquet_as_pandas () can be used if it is not known beforehand whether it is a folder or not. If the parquet file has been created with spark, (so it's a directory) to import it to pandas use from pyarrow.parquet import ParquetDataset dataset = ParquetDataset ("file.parquet") table = dataset.read () df = table.to_pandas ()
Load a parquet object from the file path, returning a DataFrame. If not None, only these columns will be read from the file. Index column of table in Spark. If True, try to respect the metadata if the Parquet file is written from pandas. All other options passed directly into Spark’s data source.
Pyspark Write DataFrame to Parquet file format Now let’s create a parquet file from PySpark DataFrame by calling the parquet () function of DataFrameWriter class. When you write a DataFrame to parquet file, it automatically preserves column names and their data types. Each part file Pyspark creates has the.parquet file extension.
Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python Thank you for your answer. It seems that reading single files (your second bullet point) works.
The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.
You can circumvent this issue in different ways:
Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset
, and then convert that to Pandas (I did not test this code).
arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
arrow_table = arrow_dataset.read()
pandas_df = arrow_table.to_pandas()
Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python
Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:
import pandas as pd
import datetime
import os
def read_parquet_folder_as_pandas(path, verbosity=1):
files = [f for f in os.listdir(path) if f.endswith("parquet")]
if verbosity > 0:
print("{} parquet files found. Beginning reading...".format(len(files)), end="")
start = datetime.datetime.now()
df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
df = pd.concat(df_list, ignore_index=True)
if verbosity > 0:
end = datetime.datetime.now()
print(" Finished. Took {}".format(end-start))
return df
def read_parquet_as_pandas(path, verbosity=1):
"""Workaround for pandas not being able to read folder-style parquet files.
"""
if os.path.isdir(path):
if verbosity>1: print("Parquet file is actually folder.")
return read_parquet_folder_as_pandas(path, verbosity)
else:
return pd.read_parquet(path)
This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).
The function read_parquet_as_pandas()
can be used if it is not known beforehand whether it is a folder or not.
If the parquet file has been created with spark, (so it's a directory) to import it to pandas use
from pyarrow.parquet import ParquetDataset
dataset = ParquetDataset("file.parquet")
table = dataset.read()
df = table.to_pandas()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With