I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.
Is an R reader available? Or is work being done on one?
If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr
'Parquet' is a columnar storage file format. This function enables you to read Parquet files into R.
There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details. It supports complex data type like array, map, etc.
Encodings in ParquetIt encodes the values back to back and is used as a last resort when there is no more efficient encoding for given data. The plain encoding always reserves the same amount of place for given type. For instance, an int 32 bits will be always stored in 4 bytes.
If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.
# install the SparkR package devtools::install_github('apache/spark', ref='master', subdir='R/pkg') # load the SparkR package library('SparkR') # initialize sparkContext which starts a new Spark session sc <- sparkR.init(master="local") # initialize sqlContext sq <- sparkRSQL.init(sc) # load parquet file into a Spark data frame and coerce into R data frame df <- collect(parquetFile(sq, "/path/to/filename")) # terminate Spark session sparkR.stop()
An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665
I'm not aware of any other package that you could use if you weren't using Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With