I am trying to read and write a trivial dataset into Julia. The dataset is <code>mtcars</code>, taken from R, with an arbitrarily added column <code>bt</code> with random Boolean values. The file/folder structure (below) was written out using the R <code>arrow</code> package. The files are laid out as follows: <pre class="prettyprint"><code>arr |-- bt=false | `-- part-1.arrow `-- bt=true `-- part-0.arrow </code></pre> How can I faithfully reproduce the original table in Julia? What I've tried so far: <ol> <li> Using the <code>Parquet.jl</code> package. Documentation suggests that it should automatically detect partitioning folder structure for columns of bool/string/date type. When I read the data in, using <code>read_parquet(path; kwargs)</code>, the resulting data structure does not have the <code>bt</code> column. I've tried setting the <code>column_generator</code> keyword argument to the default <code>Parquet.dataset_column_generator</code> but this did not work. </li> <li> Using <code>Arrow.jl</code> - I cannot find a documented way (unless I misunderstood) to directly read in a partitioned data structure. </li> </ol> R does not generate additional metadata files to store the schema, but I understand this is optional and not part of the arrow spec?

Try this. They have listed a method as this Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the <code>Tables.partitions</code> method. <pre class="prettyprint"><code>using Parquet, DataFrames for partition in Tables.partitions(read_parquet(path)) df = DataFrame(partition) ... end </code></pre> For further reference: https://github.com/JuliaIO/Parquet.jl

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

Tags:

parquet

julia

apache-arrow

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package.

The files are laid out as follows:

arr
|-- bt=false
|   `-- part-1.arrow
`-- bt=true
    `-- part-0.arrow

How can I faithfully reproduce the original table in Julia?

What I've tried so far:

Using the Parquet.jl package. Documentation suggests that it should automatically detect partitioning folder structure for columns of bool/string/date type. When I read the data in, using read_parquet(path; kwargs), the resulting data structure does not have the bt column. I've tried setting the column_generator keyword argument to the default Parquet.dataset_column_generator but this did not work.
Using Arrow.jl - I cannot find a documented way (unless I misunderstood) to directly read in a partitioned data structure.

R does not generate additional metadata files to store the schema, but I understand this is optional and not part of the arrow spec?

684

asked May 17 '21 17:05

tinker

1 Answers

Try this. They have listed a method as this

Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the Tables.partitions method.

using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
    df = DataFrame(partition)
    ...
end

For further reference: https://github.com/JuliaIO/Parquet.jl

146

answered Oct 24 '22 01:10

Udara Weerasinghe

Related questions
                            
                                How do you check if a variable is defined inside a module in Julia?
                            
                                Puzzling results for Julia typeof
                            
                                How to alias quit() to quit?
                            
                                HowTo Install a Package in Julia 1.0
                            
                                Unexpected memory allocation when using array views (julia)
                            
                                How to print in REPL the code of functions in Julia?
                            
                                indices of unique elements of vector in Julia
                            
                                Is there a way to avoid creating an array in this Julia expression?
                            
                                How to work with categorical data in Julia?
                            
                                Julia ternary operator without `else`
                            
                                Efficient way to sum an array of integers in Julia
                            
                                allclose - How to check if two arrays are close in Julia

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With