A help for the implementation best practice is needed. The operating environment is as follows:
In a test, I tried to process 160,000 post-processed files by Spark starting with sc.textFile() with glob path, it failed with OutOfMemory exception on the driver process.
What is the best practice to handle this kind of data? Should I use HBase instead of plain files to save post-processed data?
We wrote own loader. It solved our problem with small files in HDFS. It uses Hadoop CombineFileInputFormat. In our case it reduced the number of mappers from 100000 to approx 3000 and made job significantly faster.
https://github.com/RetailRocket/SparkMultiTool
Example:
import ru.retailrocket.spark.multitool.Loaders
val sessions = Loaders.combineTextFile(sc, "file:///test/*")
// or val sessions = Loaders.combineTextFile(sc, conf.weblogs(), size = 256, delim = "\n")
// where size is split size in Megabytes, delim - line break character
println(sessions.count())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With