Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?

like image 581
Rahul Kanodiya Avatar asked Jul 03 '17 16:07

Rahul Kanodiya


1 Answers

Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.

Pseudocode:

val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework

What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument

like image 128
T. Gawęda Avatar answered Sep 26 '22 01:09

T. Gawęda