Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does using spark in stand-alone on 1 large computer make sense?

I am working with ~120Gb of csv files (from 1Gb to 20Gb each). I am using a 220Gb Ram computer with 36 theads.

I was wondering if it makes sense to use spark in stand-alone mode for this analysis? I really like the natural concurrency of spark plus (with pyspark) I have a nice notebook environment to use.

I want to do joins/aggregation type stuff and run machine learning on the transformed dataset. Python tools like pandas only want to use 1 thread which seems like a massive waste since using all 36 threads must be much faster..

like image 606
anthonybell Avatar asked Mar 16 '23 09:03

anthonybell


1 Answers

To answer your question, YES, if you only have one node available, especially one as powerful as you describe (as long as it can handle the size of the data) it does make sense.

I would recommend running your application in "local" mode, since you are only using 1 node. When you run ./spark-submit, specify:

--master local[*]

as in:

./spark-submit --master local[*] <your-app-name> <your-apps-args>

This will run the application on the local node using all available cores.

Remember that in your application you must specify the amount of executor memory that you want you application to use; by default this is 512m. If you want to take advantage of all of your memory, you can change this either as a parameter to spark-submit or in your application code when making your SparkConf object.

like image 197
cnnrznn Avatar answered Mar 18 '23 11:03

cnnrznn