Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Parquet files from Scala without using Spark

Tags:

Is it possible to read parquet files from Scala without using Apache Spark?

I found a project which allows us to read and write avro files using plain scala.

https://github.com/sksamuel/avro4s

However I can't find a way to read and write parquet files using plain scala program without using Spark?

like image 497
Knows Not Much Avatar asked Feb 05 '16 23:02

Knows Not Much


People also ask

How do I view Parquet data?

If there is a pre-existing file association, right click on any . parquet file, select Open With ... Choose Another App and select parquetfile .

Why does Parquet work better than spark?

It is well-known that columnar storage saves both time and space when it comes to big data processing. Parquet, for example, is shown to boost Spark SQL performance by 10X on average compared to using text, thanks to low-level reader filters, efficient execution plans, and in Spark 1.6. 0, improved scan throughput!


1 Answers

It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.

Some sample code

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]] // iter is of type Iterator[GenericRecord] val iter = Iterator.continually(reader.read).takeWhile(_ != null) // if you want a list then... val list = iter.toList 

This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:

case class Bibble(name: String, location: String) val format = RecordFormat[Bibble] // then for a given record val bibble = format.from(record) 

We can obviously combine that with the original iterator in one step:

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]] val format = RecordFormat[Bibble] // iter is now an Iterator[Bibble] val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from) // and list is now a List[Bibble] val list = iter.toList 
like image 96
sksamuel Avatar answered Sep 20 '22 22:09

sksamuel