Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet VS Database

I am trying to understand which of the below two would be better option especially in case of Spark environment :

  1. Loading the parquet file directly into a dataframe and access the data (1TB of data table)
  2. Using any database to store and access the data.

I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.

like image 265
BlackJack Avatar asked Nov 07 '22 14:11

BlackJack


1 Answers

Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.

like image 134
Sahil Desai Avatar answered Nov 15 '22 12:11

Sahil Desai