Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using

pip install pyspark 

I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).

Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:

  • what is the exact connection between these two technologies?
  • why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
  • if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
like image 559
Ferrard Avatar asked Aug 07 '18 13:08

Ferrard


People also ask

Is PySpark and Spark same?

PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.

Can I run PySpark without Java?

1. Step 1. PySpark requires Java version 7 or later and Python version 2.6 or later.

What is needed to run PySpark?

To run PySpark application, you would need Java 8 or later version hence download the Java version from Oracle and install it on your system. Post installation, set JAVA_HOME and PATH variable.

Is PySpark a Spark or Python?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


1 Answers

As of v2.2, executing pip install pyspark will install Spark.

If you're going to use Pyspark it's clearly the simplest way to get started.

On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars

like image 122
Kirk Broadhurst Avatar answered Oct 19 '22 08:10

Kirk Broadhurst