I am trying to understand which of the below two would be better option especially in case of Spark environment :
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With