I want to use Apache Spark and connect to Vertica by JDBC.
In Vertica database, I have 100 million records and spark code runs on another server.
When I run the query in Spark and monitor network usage, traffic between two servers is very high.
It seems Spark loads all data from target server.
this is my code:
test_df = spark.read.format("jdbc")
.option("url" , url).option("dbtable", "my_table")
.option("user", "user").option("password" , "pass").load()
test_df.createOrReplaceTempView('tb')
data = spark.sql("select * from tb")
data.show()
when I run this, after 2 minutes and very high network usage, result returned.
Does Spark load the entire data from target database?
JDBC based DBs allow push down queries so that you will read from the disk only relevant items: ex: df.filter("user_id == 2").count will first select only records filtered and then ship count to spark. So using JDBC: 1. plan for filters, 2. partition your DB according to your query patterns and further optimise form spark side as ex:
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("partitionColumn", "user_id")
prop.setProperty("lowerBound", "1")
prop.setProperty("upperBound", "272")
prop.setProperty("numPartitions", "30")
However, most relational DB are partitioned by specific fields in a tree lke structure which is not ideal for complex big data queries: I strongly suggest to copy the table from JDBC to no-sql such as cassandra, mongo, elastic serach or file systems such as alluxio or hdfs in order to enable scalable - parallel - complex - fast queries. Lastly, you can replace JDBC with aws redshift which should not be that hard to implement for backend / front end, however from your spark side it is a pain to deal with re dependency conflicts - but it will enable you to conduct complex queries much faster as it partition columns so you can have push down aggregates on columns themselves using multiple workers
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With