Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-submit fails to detect the installed modulus in pip

I have a python code which have the following 3rd party dependencies:

import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....

I have installed the dependencies using pip, and made sure that it was correctly installed by having the command pip list. Then, when I tried to submit the job to spark, I received the following errors:

ImportError: No module named 'boto3'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The problem of no module named not only happens with 'boto3' but also with other modules.


I tried the following things:

  1. Added SparkContext.addPyFile(".zip files")
  2. Using submit-spark --py-files
  3. Reinstall pip
  4. Made sure the path env variables export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH and installed pip install py4j
  5. Used python instead of spark-submit

Software information:

  • Python version: 3.4.3
  • Spark version: 2.2.0
  • Running on EMR-AWS: Linux version 2017.09
like image 871
COLD ICE Avatar asked Oct 11 '25 07:10

COLD ICE


2 Answers

Before doing spark-submit try going to python shell and try importing the modules. Also check which python shell (check python path) is opening up by default.

If you are able to successfully import these modules in python shell (same python version as you trying to use in spark-submit), please check following:

In which mode are you submitting the application? try standalone or if on yarn try client mode. Also try adding export PYSPARK_PYTHON=(your python path)

like image 93
joshi.n Avatar answered Oct 14 '25 06:10

joshi.n


All checks mentioned above worked ok but setting PYSPARK_PYTHON solved the issue for me.

like image 36
user2744408 Avatar answered Oct 14 '25 04:10

user2744408



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!