I am trying to install PySpark on Google Colab using the code given below but getting the following error.
This code has ran successfully once. But it is throwing this error after the notebook restart. I have even tried running this from a different Google account but same error again.
(Also is there any way that we don't need to install PySpark everytime after the notebook re-start?)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
This following line seems to cause the problem as it is not finding the downloaded file.
!tar xvf spark-2.3.2-bin-hadoop2.7.tgz
I have also tried the following two lines (instead of above two lines) suggested somewhere on medium blog. But nothing better.
!wget -q http://mirror.its.dal.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xvf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
-------------------------------------------------------------------------------------------------------------------------------Any ideas how to get out of this error and install PySpark on Colab?
I am running pyspark on colab by just using
!pip install pyspark
and it works fine.
Date: 6-09-2020
Step 1 : Install pyspark on google colab
!pip install pyspark
Step 2 : Dealing with pandas and spark Dataframe inside spark session
!pip install pyarrow
It facilitates communication between many components, for example, reading a parquet file with Python (pandas) and transforming to a Spark data frame, Falcon Data Visualization or Cassandra without worrying about conversion.
Step 3 : Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()
Done ⭐
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With