How do I read a partitioned parquet file into R with arrow (without any spark)
The situation
The parquet file structure
The parquet files created from my Spark consists of several parts
tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc
How do I read this component_mapping.parquet into R?
What I tried
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")
but this fails with the error
IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory
It works if I just read one file of the directory
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")
but I need to load all in order to query on it
What I found in the documentation
In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.
read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)
How do I set the properties correctly to read the full directory?
# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)
Help would be very appreciated
As @neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow
package (I'm running 4.0.0 currently) this is possible.
I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)
The Dataset
API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html
The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.
library(arrow)
## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)
Solution for: Read partitioned parquet files from local file system into R dataframe with arrow
As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr
, SparkR
or reticulate
and dplyr
as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?
I solved my task now with your proposal using arrow
together with lapply
and rbindlist
my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))
looking forward until the apache arrow functionality is available Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With