Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package.

The files are laid out as follows:

arr
|-- bt=false
|   `-- part-1.arrow
`-- bt=true
    `-- part-0.arrow

How can I faithfully reproduce the original table in Julia?

What I've tried so far:

  1. Using the Parquet.jl package. Documentation suggests that it should automatically detect partitioning folder structure for columns of bool/string/date type. When I read the data in, using read_parquet(path; kwargs), the resulting data structure does not have the bt column. I've tried setting the column_generator keyword argument to the default Parquet.dataset_column_generator but this did not work.

  2. Using Arrow.jl - I cannot find a documented way (unless I misunderstood) to directly read in a partitioned data structure.

R does not generate additional metadata files to store the schema, but I understand this is optional and not part of the arrow spec?

like image 684
tinker Avatar asked May 17 '21 17:05

tinker


People also ask

Can we read Parquet files?

We can always read the parquet file to a dataframe in Spark and see the content. They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.

What is Apache Parquet format?

Apache Parquet is a file format designed to support fast data processing for complex data. Unlike row-based formats like CSV, Parquet is column-oriented – meaning the values of each table column are stored next to each other rather than those of each record.

How does read_table work in Apache Parquet?

As you can learn more in the Apache Parquet format, a Parquet file consists of multiple row groups. read_table will read all of the row groups and concatenate them into a single table.

How can I read a single file from a parquet dataset?

In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table: You can pass a subset of columns to read, which can be much faster than reading the whole file (due to the columnar layout):

How do I read multiple row groups in a Parquet file?

As you can learn more in the Apache Parquet format, a Parquet file consists of multiple row groups. read_table will read all of the row groups and concatenate them into a single table. You can read individual row groups with read_row_group:


1 Answers

Try this. They have listed a method as this

Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the Tables.partitions method.

using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
    df = DataFrame(partition)
    ...
end

For further reference: https://github.com/JuliaIO/Parquet.jl

like image 146
Udara Weerasinghe Avatar answered Oct 24 '22 01:10

Udara Weerasinghe