Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says

tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now.

These were my steps

  • !apt-get install openjdk-8-jdk-headless -qq > /dev/null
  • !wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
  • !tar xf spark-2.2.1-bin-hadoop2.7.tgz
  • !pip install -q findspark

enter image description here

like image 588
Prasoon Parashar Avatar asked Mar 19 '19 12:03

Prasoon Parashar


3 Answers

The problem is due to the download link you are using to download spark:

http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

To download spark without having any problem, you should download it from their archive website (https://archive.apache.org/dist/spark).

For example, the following download link from their archive website works fine:

https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

Here is the complete code to install and setup java, spark and pyspark:

# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

For python users, you should also install pyspark using the following command.

!pip install pyspark

like image 61
liedji Avatar answered Sep 19 '22 15:09

liedji


This error is about the link you've used in the second line of the code. The following snippet worked for me on the Google Colab. Do not forget to change the spark version to the latest one and SPARK-HOME path accordingly. You can find the latest versions here: https://downloads.apache.org/spark/

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
like image 23
ImanB Avatar answered Sep 18 '22 15:09

ImanB


This is the correct code. I just tested it.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
like image 32
Matteo Avatar answered Sep 18 '22 15:09

Matteo