How to apply large python model to pyspark-dataframe?

Tags:

I have:

Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features)
Several huge ML models (each one takes 5-15GB of RAM)
Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if needed.

I want to apply them using PySpark, but I always get some wired errors like:

OOM
Random timeouts (nodes doesn't return any result) -> node killed by YARN manager

I typically used code like

def apply_model(partition):
    model = load(...)  # load model only when apply this function to avoid serialization issue
    for row in partition:
        yield model.infer(row)

def apply_model(partition):
    model = load(...)  # load model only when apply this function to 
    yield from model.infer(partition)

and apply that using

df.select(...).rdd.mapPartitions(apply_model)

I can't broadcast model, by serialization reasons.

The question - how to apply the big python/any-non-jvm-based model to spark dataframe & avoid spark exceptions?

772

asked May 15 '19 15:05

Ivan Menshikh

1 Answers

Here are some additional suggestions that could help improving the performance of your job:

The first change I would do is to reduce the partition size. If I understood correctly at the moment you have input data of 4.5TB. That means if you have 1000 partitions then you will end up sending 4,5GB per partition on each executor! This size is considered quite large, instead I would try to keep the partition size between 250-500MB. Roughly in your case that would mean ~10000 (4.5TB / 500MB) partitions.
Increase parallelism by adding more executors. That would increase the level of data locality and consequently reduce the execution time. Ideally you should have 5cores per executor and two executors (if possible) for each cluster node. The max cores per executor should not be higher than 5 since that would cause I/O bottlenecks (when/if disk storage is used).
As for the memory the suggestions from @rluta I think are more than sufficient. In general too large values for executor's memory would have a negative effect on Java GC time therefore an upper limit of 10GB should be the ideal value for spark.executor.memory.

166

answered Sep 28 '22 05:09

abiratsis

Related questions
                            
                                python what is the type of typing.Optional [duplicate]
                            
                                Mock/Test Calls to Path.open
                            
                                How to get the files uploaded in InMemoryUploadedFile django
                            
                                How to make async gRPC calls in Python?
                            
                                How do I automatically identify crypto best practices in python code
                            
                                Wrong answer from math.log(python 3)
                            
                                HEAD requests with aiohttp is dog slow
                            
                                Python Wget: Check for duplicate files and skip if it exists?
                            
                                How to test Jupyter notebooks on Travis CI?
                            
                                Django-admin TypeError: __init__() got an unexpected keyword argument 'allow_abbrev'
                            
                                How to implement a custom keras layer that only keeps the top n values and zeros out all the rest?
                            
                                How to prevent passphrase-caching from within a gpgme-based Python script?
                            
                                How to shuffle tensor in tensorflow? error:No gradient defined for operation 'RandomShuffle'
                            
                                Renaming Pandas DataFrame columns that are numbers
                            
                                How do you add a value to a float index of a dataframe for every other row?
                            
                                Keeping the same, shared virtualenvs when switching from pyenv-virtualenv to pipenv
                            
                                Randomly sampling from large combination generator
                            
                                Why mutable built-in objects cannot be hashable in Python? What is the benefit of this?
                            
                                How can I display the weights and bias from LinearRegression()?
                            
                                When to use a UDF versus a function in PySpark? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply large python model to pyspark-dataframe?

Tags:

python

machine-learning

apache-spark

pyspark

pyspark-sql

Ivan Menshikh

People also ask

1 Answers

abiratsis

Recent Activity

Donate For Us