Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr

like image 634
metasim Avatar asked May 22 '15 17:05

metasim


People also ask

Can you read a Parquet in R?

'Parquet' is a columnar storage file format. This function enables you to read Parquet files into R.

How do you read data from Parquet?

There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details. It supports complex data type like array, map, etc.

What is Parquet encoding?

Encodings in ParquetIt encodes the values back to back and is used as a last resort when there is no more efficient encoding for given data. The plain encoding always reserves the same amount of place for given type. For instance, an int 32 bits will be always stored in 4 bytes.


1 Answers

If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.

# install the SparkR package devtools::install_github('apache/spark', ref='master', subdir='R/pkg')  # load the SparkR package library('SparkR')  # initialize sparkContext which starts a new Spark session sc <- sparkR.init(master="local")  # initialize sqlContext sq <- sparkRSQL.init(sc)  # load parquet file into a Spark data frame and coerce into R data frame df <- collect(parquetFile(sq, "/path/to/filename"))  # terminate Spark session sparkR.stop() 

An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665

I'm not aware of any other package that you could use if you weren't using Spark.

like image 164
Andy Judson Avatar answered Sep 27 '22 21:09

Andy Judson