Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to set a minimum batch size for a pandas_udf in PySpark?

I am using a pandas_udf to apply a machine learning model on my spark cluster and am interested in predefining the minimum number of records sent via arrow to the UDF.

I followed the databricks tutorial for the bulk of the UDF... https://docs.databricks.com/applications/deep-learning/inference/resnet-model-inference-tensorflow.html

From the tutorial, I set the spark conference to have a maximum batch size and enabled arrow. I can easily set the maximum batch size however I was wondering if there is a similar method for setting a minimal batch size that the UDF will handle?

spark = SparkSession.builder.appName('App').getOrCreate()

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', PyArrowBatchSize)

I am running spark version 2.4.3 and python 3.6.0.

like image 953
Jlanday Avatar asked Sep 15 '25 02:09

Jlanday


1 Answers

There is no way to set the minimum batch size in the Spark docs, but in this case max is a bit misleading. This should be something like "batch size before remainder".

Ex: If you have 100132 rows in your dataset, and your maxRecordsPerBatch is 10000, then you will get 10 batches of size 10000, and one batch of size 132 as the remainder. (If you have multiple executors, you may have additional batches with remainders, depending on how things are split up.)

You can know that your approximate min batch size is dependent of your remainder, and otherwise all batch sizes will be exactly min batch size.

like image 132
K.S. Avatar answered Sep 17 '25 18:09

K.S.