I am getting error while installing spark on Google Colab. It says
tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now.
These were my steps
enter image description here
The problem is due to the download link you are using to download spark:
http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
To download spark without having any problem, you should download it from their archive website (https://archive.apache.org/dist/spark
).
For example, the following download link from their archive website works fine:
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
Here is the complete code to install and setup java, spark and pyspark:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
# set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"
# install findspark using pip
!pip install -q findspark
For python users, you should also install pyspark
using the following command.
!pip install pyspark
This error is about the link you've used in the second line of the code. The following snippet worked for me on the Google Colab. Do not forget to change the spark version to the latest one and SPARK-HOME path accordingly. You can find the latest versions here: https://downloads.apache.org/spark/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
This is the correct code. I just tested it.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With