How to run independent transformations in parallel using PySpark?

Tags:

I am trying to run 2 functions doing completely independent transformations on a single RDD in parallel using PySpark. What are some methods to do the same?

def doXTransforms(sampleRDD):
    (X transforms)

def doYTransforms(sampleRDD):
    (Y Transforms)

if __name__ == "__main__":
    sc = SparkContext(appName="parallelTransforms")
    sqlContext = SQLContext(sc)
    hive_context = HiveContext(sc)

    rows_rdd = hive_context.sql("select * from tables.X_table")

    p1 = Process(target=doXTransforms , args=(rows_rdd,))
    p1.start()
    p2 = Process(target=doYTransforms, args=(rows_rdd,))  
    p2.start()
    p1.join()
    p2.join()
    sc.stop()

This does not work and I now understand this will not work. But is there any alternative way to make this work? Specifically are there any python-spark specific solutions?

623

asked Jun 27 '16 07:06

preitam ojha

1 Answers

Just use threads and make sure that cluster have enough resources to process both tasks at the same time.

from threading import Thread
import time

def process(rdd, f):
    def delay(x):
        time.sleep(1)
        return f(x)
    return rdd.map(delay).sum()


rdd = sc.parallelize(range(100), int(sc.defaultParallelism / 2))

t1 = Thread(target=process, args=(rdd, lambda x: x * 2))
t2  = Thread(target=process, args=(rdd, lambda x: x + 1))
t1.start(); t2.start()

Arguably this is not that often useful in practice but otherwise should work just fine.

You can further use in-application scheduling with FAIR scheduler and scheduler pools for a better control over execution strategy.

You can also try pyspark-asyncactions (disclaimer - the author of this answer is also the author of the package) which provides a set of wrappers around Spark API and concurrent.futures:

import asyncactions
import concurrent.futures

f1 = rdd.filter(lambda x: x % 3 == 0).countAsync()
f2 = rdd.filter(lambda x: x % 11 == 0).countAsync()

[x.result() for x in concurrent.futures.as_completed([f1, f2])]

answered Sep 22 '22 18:09

zero323

Related questions
                            
                                Read Multiple images on a folder in OpenCv (python)
                            
                                dyld: Library not loaded: @executable_path/../.Python
                            
                                Tensorflow - casting from int to float strange behavior
                            
                                Mutation testing tool for Python 2.7
                            
                                setup.py & pip: override one of the dependency's sub-dependency from requirements.txt
                            
                                ImportError: No Module Named <parent dir>
                            
                                How to efficiently pass function through?
                            
                                Python 'except' fall-through
                            
                                What to do with missing values when plotting with seaborn?
                            
                                How to slice a list of tuples in python?
                            
                                Using super() in a property's setter method when using the @property decorator raises an AttributeError
                            
                                how to use tf-idf with Naive Bayes?
                            
                                How are Python modules (which are shared libraries) imported without a .py file?
                            
                                Is a Scripts directory an anti-pattern in Python? If so, what's the right way to import?
                            
                                How to use botocore.response.StreamingBody as stdin PIPE
                            
                                What are the differences between bool() and operator.truth()?
                            
                                Does this prime function actually work?
                            
                                Installing numpy for Windows 10: Importing the multiarray numpy extension module failed
                            
                                Writing a .CSV file in Python that works for both Python 2.7+ and Python 3.3+ in Windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run independent transformations in parallel using PySpark?

Tags:

python-2.7

apache-spark

python-multiprocessing

apache-spark-sql

pyspark

preitam ojha

People also ask

1 Answers

zero323

Recent Activity

Donate For Us