Parquet VS Database

Question

I am trying to understand which of the below two would be better option especially in case of Spark environment :

Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.

I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.

Sahil Desai · Accepted Answer

Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.

Parquet VS Database

Tags:

apache-spark

parquet

BlackJack

1 Answers

Sahil Desai

Recent Activity

Donate For Us

Parquet VS Database

Tags:

apache-spark

parquet

BlackJack

1 Answers

Sahil Desai

Related questions

Recent Activity

Donate For Us