Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark UI displays incorrect input size of file being ingested

My Java spark program ingests a file of 3.7 GB. When I launch the spark program and go to the Spark UI on port localhost:4040 The input size shown for the load stage is 7.3 GB??? That's really confusing. Why is the input size in the Spark UI console showing almost double than the actual file size being ingested?

enter image description here

like image 956
user836087 Avatar asked Oct 21 '18 19:10

user836087


1 Answers

The input size:

  • Is estimated.
  • Is not the input size of the file you load, but the input size of the loaded object, which in general, require more memory to store than a serialized objects (pointers to actual objects, overhead of the data structures used to load the data).
like image 73
user10553610 Avatar answered Nov 15 '22 07:11

user10553610