Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to install PySpark on Google Colab

I am trying to install PySpark on Google Colab using the code given below but getting the following error.

tar: spark-2.3.2-bin-hadoop2.7.tgz: Cannot open: No such file or directory

tar: Error is not recoverable: exiting now

This code has ran successfully once. But it is throwing this error after the notebook restart. I have even tried running this from a different Google account but same error again.

(Also is there any way that we don't need to install PySpark everytime after the notebook re-start?)

code:

--------------------------------------------------------------------------------------------------------------------------------

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

This following line seems to cause the problem as it is not finding the downloaded file.

!tar xvf spark-2.3.2-bin-hadoop2.7.tgz

I have also tried the following two lines (instead of above two lines) suggested somewhere on medium blog. But nothing better.

!wget -q http://mirror.its.dal.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

!tar xvf spark-2.4.0-bin-hadoop2.7.tgz

!pip install -q findspark

-------------------------------------------------------------------------------------------------------------------------------

Any ideas how to get out of this error and install PySpark on Colab?

like image 258
Ankit Sharma Avatar asked Apr 06 '19 10:04

Ankit Sharma


2 Answers

I am running pyspark on colab by just using

!pip install pyspark

and it works fine.

like image 126
Harmeet Avatar answered Oct 06 '22 03:10

Harmeet


Date: 6-09-2020


Step 1 : Install pyspark on google colab

!pip install pyspark

Step 2 : Dealing with pandas and spark Dataframe inside spark session

!pip install pyarrow

It facilitates communication between many components, for example, reading a parquet file with Python (pandas) and transforming to a Spark data frame, Falcon Data Visualization or Cassandra without worrying about conversion.

Step 3 : Create Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()

Done ⭐

like image 33
Vinay Chaudhari Avatar answered Oct 06 '22 02:10

Vinay Chaudhari