I am working with ~120Gb of csv files (from 1Gb to 20Gb each). I am using a 220Gb Ram computer with 36 theads.
I was wondering if it makes sense to use spark in stand-alone mode for this analysis? I really like the natural concurrency of spark plus (with pyspark) I have a nice notebook environment to use.
I want to do joins/aggregation type stuff and run machine learning on the transformed dataset. Python tools like pandas only want to use 1 thread which seems like a massive waste since using all 36 threads must be much faster..
To answer your question, YES, if you only have one node available, especially one as powerful as you describe (as long as it can handle the size of the data) it does make sense.
I would recommend running your application in "local" mode, since you are only using 1 node. When you run ./spark-submit, specify:
--master local[*]
as in:
./spark-submit --master local[*] <your-app-name> <your-apps-args>
This will run the application on the local node using all available cores.
Remember that in your application you must specify the amount of executor memory that you want you application to use; by default this is 512m. If you want to take advantage of all of your memory, you can change this either as a parameter to spark-submit or in your application code when making your SparkConf object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With