I created two clusters on Google Compute Engine and that clusters read 100 GB data.
Cluster I: 1 master - 15 GB memory - 250 GB disk 10 nodes - 7.5 GB memory - 200 GB disk
Cluster II: 1 master - 15 GB memory - 250 GB disk 150 nodes - 1.7 GB memory - 200 GB disk
I am using that to read file:
val df = spark.read.format("csv")
.option("inferSchema", true)
.option("maxColumns",900000)
.load("hdfs://master:9000/tmp/test.csv")
Also this is a dataset that contains 55k rows and 850k columns.
Q1: I did not see a significant increase in reading speed although I increased the number of machines. What is wrong or what to do make this process faster? Should I increase nodes more?
Q2: Is the increase in the number of machines important to be faster or is the increase in the amount of memory important for Spark? Is there a performance graph between nodes, memory and speed?
Q3: Also copy or move commands for hadoop are working very slow. Data is just 100 GB. How does big companies deal with terabytes of data? I could not capture the increase in data reading speed.
Thanks for your answers
TL;DR Spark SQL (as well as Spark in general and other projects sharing similar architecture and design) is primarily designed to handle long and (relatively) narrow data. This is the exact opposite of your data, where input is wide and (relatively) short.
Remember that although Spark uses columnar formats for caching its core processing model handles rows (records) of data. If data is wide but short, it not only limits ability to distribute the data, but what's more important, leads to initialization of very large objects. This has detrimental impact on overall memory management and garbage collection process (What is large object for JVM GC).
Using very wide data with Spark SQL causes additional problems. Different optimizer components have non-linear complexity in terms of expressions used in a query. This is usually not a problem with data is narrow (< 1K columns), but can easily become a bottleneck with wider datasets.
Additionally you're using input format which is not well suited for high performance analytics and expensive reader options (schema inference).
Depending on what you know about the data and how you plan to process it later you can try to address some of these issues, for by converting to long format on read, or encoding data directly using some sparse representation (if applicable).
Other than that your best choice is careful memory and GC tuning based on runtime statistics.
Don't use inferSchema instead of these manually provide a schema. spark take's time to inferSchema for huge data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With