Can PySpark work without Spark?

Tags:

pyspark

I have installed PySpark standalone/locally (on Windows) using

pip install pyspark

I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).

Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:

what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)

559

asked Aug 07 '18 13:08

Ferrard

1 Answers

As of v2.2, executing pip install pyspark will install Spark.

If you're going to use Pyspark it's clearly the simplest way to get started.

On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars

122

answered Oct 19 '22 08:10

Kirk Broadhurst

Related questions
                            
                                How to calculate mean and standard deviation given a PySpark DataFrame?
                            
                                Comparison operator in PySpark (not equal/ !=)
                            
                                Recursively fetch file contents from subdirectories using sc.textFile
                            
                                How to get a value from the Row object in Spark Dataframe?
                            
                                Create Spark Dataframe from SQL Query
                            
                                How to access SparkContext from SparkSession instance?
                            
                                Add new rows to pyspark Dataframe
                            
                                How to suppress printing of variable values in zeppelin
                            
                                (null) entry in command string exception in saveAsTextFile() on Pyspark
                            
                                Spark throws ClassNotFoundException when using --jars option
                            
                                How to use NOT IN clause in filter condition in spark
                            
                                How to get day of week in SparkSQL?
                            
                                Spark Row to JSON
                            
                                Convert a standard python key value dictionary list to pyspark data frame
                            
                                Spark Parallelize? (Could not find creator property with name 'id')
                            
                                What are SparkSession Config Options
                            
                                How createCombiner,mergeValue, mergeCombiner works in CombineByKey in Spark ( Using Scala)
                            
                                How to explode multiple columns of a dataframe in pyspark
                            
                                'Operation timed out' error on trying to ssh in to the Amazon EMR Spark Cluster
                            
                                Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With