Installing Modules for SPARK on worker nodes

Tags:

I am running SPARK 1.3 in standalone mode in a cloudera environment. I can run pyspark from ipython notebook, however as soon as I add a second worker node my code stops running and returns an error. I am pretty sure this is because modules on my master are not visible to the worker node. I tried importing numpy but it didn't work even though I have numpy installed on my worker through anaconda. I have anaconda installed on both master and worker in the same way.

However, following Josh Rosen's advice I made sure that I installed the libraries on the worker nodes.

https://groups.google.com/forum/#!topic/spark-users/We_F8vlxvq0

However, I still seem to be getting issues. Including the fact that my worker does not recognize the command abs. which is standard in python 2.6

The code I am running is from this post:

https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD
print nums.filter(isprime).count()

299

asked Jun 25 '15 00:06

Phineas Dashevsky

1 Answers

I often use the anaconda distribution with PySpark as well and find it useful to set the PYSPARK_PYTHON variable, pointing to the python binary within the anaconda distribution. I've found that otherwise I get lots of strange errors. You might be able to check with python is being used by running rdd.map(lambda x: sys.executable).distinct().collect(). I suspect it's not pointing to the correct location.

In any case, I recommend wrapping the configuration of your path and environment variables in a script. I use the following.

def configure_spark(spark_home=None, pyspark_python=None):
    spark_home = spark_home or "/path/to/default/spark/home"
    os.environ['SPARK_HOME'] = spark_home

    # Add the PySpark directories to the Python path:
    sys.path.insert(1, os.path.join(spark_home, 'python'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))

    # If PySpark isn't specified, use currently running Python binary:
    pyspark_python = pyspark_python or sys.executable
    os.environ['PYSPARK_PYTHON'] = pyspark_python

When you point to your anaconda binary, you should also be able to import all the packages installed in its site-packages directory. This technique should work for conda environments as well.

139

answered Oct 14 '22 00:10

santon

Related questions
                            
                                How to successfully get all the tweets for one user with tweepy?
                            
                                how do you pass multiple variables to pandas dataframe to use them with .map to create a new column
                            
                                TypeError: unsupported operand type(s) for +: 'decimal' and 'float'
                            
                                Read .sph files in Python
                            
                                Plotting a function of three variables in python
                            
                                python Time a try except
                            
                                Why can't you reference modules that appear to be automatically loaded by the interpreter without an additional `import` statement?
                            
                                How do I plot GFS grib2 data with Python?
                            
                                How to use MySQL view in django?
                            
                                Insert xml node in specific location
                            
                                How to create polygons with arcs in shapely (or a better library)
                            
                                Flask-login current_user returns only ID
                            
                                PIL - The submitted file is empty Test Case
                            
                                How to move a Python script from one subpackage to another directory/package, maintaining backwards compatibility
                            
                                Print custom string upon unittest failure
                            
                                Python timezone offset wrong? [duplicate]
                            
                                Import SAS data file into python data frame
                            
                                Remove all characters which cannot be decoded in Python
                            
                                python BeautifulSoup find all input for specific form
                            
                                Python - getting duration of a video with ffprobe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Installing Modules for SPARK on worker nodes

Tags:

python

numpy

apache-spark

pyspark

Phineas Dashevsky

People also ask

1 Answers

santon

Recent Activity

Donate For Us