Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing multiple dataframes of different widths with Parquet?

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not support it, so alternative would be storing multiple Parquet files into the file system. I have a rather large number (say 10000) of relatively small frames ~1-5MB to process, so I'm not sure if this could become a concern?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dfs = []
df1 = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6]},
                   columns=["A", "B"])
df2 = pd.DataFrame(data={"X": [1, 2], "Y": [3, 4], "Z": [5, 6]},
                   columns=["X", "Y", "Z"])
dfs.append(df1)
dfs.append(df2)

for i in range(2):
    table1 = pa.Table.from_pandas(dfs[i])
    pq.write_table(table1, "my_parq_" + str(i) + ".parquet")
like image 881
Turo Avatar asked May 21 '18 21:05

Turo


People also ask

Is Parquet faster than CSV pandas?

For data analysis with Python, we all use Pandas widely. In this article, we will show that using Parquet files with Apache Arrow gives you an impressive speed advantage compared to using CSV files with Pandas while reading the content of large files.

Is Parquet better than pickle?

On read speeds, PICKLE was 10x faster than CSV, MSGPACK was 4X faster, PARQUET was 2–3X faster, JSON/HDF about the same as CSV. On write speeds, PICKLE was 30x faster than CSV, MSGPACK and PARQUET were 10X faster, JSON/HDF about the same as CSV.

Is Parquet better than CSV?

Apache Parquet is column-oriented and designed to provide efficient columnar storage compared to row-based file types such as CSV. Parquet files were designed with complex nested data structures in mind. Apache Parquet is designed to support very efficient compression and encoding schemes.

How does Parquet compression work?

In Parquet, compression is performed column by column and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding can be used for compressing integer and string data.


1 Answers

No, this is not possible as Parquet files have a single schema. They normally also don't appear as single files but as multiple files in a directory with all files being the same schema. This enables tools to read these files as if they were one, either fully into local RAM, distributed over multiple nodes or evaluate an (SQL) query on them.

Parquet will also be able to store these data frames efficiently even for this small size thus it should be a suitable serialization format for your use case. In contrast to HDF5, Parquet is only a serialization for tabular data. As mentioned in your question, HDF5 also supports a file system-like key vale access. As you have a large number of files and this might be problematic for the underlying filesystem, you should look at finding a replacement for this layer. Possible approaches for this will first serialize the DataFrame to Parquet in-memory and then store it in a key-value container, this could either be a simple zip archive or a real key value store like e.g. LevelDB.

like image 71
Uwe L. Korn Avatar answered Sep 28 '22 00:09

Uwe L. Korn