I have a <code>DataFrame</code> with many columns of <code>str</code> type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a <code>for-in</code> loop executing <code>withColumn</code> (see example bellow), but normally when I run the code, it shows a <code>Stack Overflow</code> (it rarely works), this <code>DataFrame</code> is not big at all, it has just ~15000 records. <pre class="prettyprint"><code># df is a DataFrame def lowerCase(string): return string.strip().lower() lowerCaseUDF = udf(lowerCase, StringType()) for (columnName, kind) in df.dtypes: if(kind == "string"): df = df.withColumn(columnName, lowerCaseUDF(df[columnName])) df.select("Tipo_unidad").distinct().show() </code></pre> The complete error is very long, therefore I decided to paste only some lines. But you can find the full trace here Complete Trace <blockquote> Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774) </blockquote> I am thinking that this problem is produced because this code launches many jobs (one for each column of type <code>string</code>), could you show me another alternative or what I am doing wrong?

Try something like this: <pre class="prettyprint"><code>from pyspark.sql.functions import col, lower, trim exprs = [ lower(trim(col(c))).alias(c) if t == "string" else col(c) for (c, t) in df.dtypes ] df.select(*exprs) </code></pre> This approach has two main advantages over you current solution: <ul> <li>it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.</li> <li>it operates directly only an internal representation without passing data to Python (<code>BatchPythonProcessing</code>).</li> </ul>

Stack Overflow while processing several columns with a UDF

Tags:

python

apache-spark

apache-spark-sql

pyspark

user-defined-functions

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a for-in loop executing withColumn (see example bellow), but normally when I run the code, it shows a Stack Overflow (it rarely works), this DataFrame is not big at all, it has just ~15000 records.

# df is a DataFrame
def lowerCase(string):
    return string.strip().lower()

lowerCaseUDF = udf(lowerCase, StringType())

for (columnName, kind) in df.dtypes:
    if(kind == "string"):
        df = df.withColumn(columnName, lowerCaseUDF(df[columnName]))

df.select("Tipo_unidad").distinct().show()

The complete error is very long, therefore I decided to paste only some lines. But you can find the full trace here Complete Trace

Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774)

I am thinking that this problem is produced because this code launches many jobs (one for each column of type string), could you show me another alternative or what I am doing wrong?

790

asked Jan 28 '16 16:01

Alberto Bonsanto

1 Answers

Try something like this:

from pyspark.sql.functions import col, lower, trim

exprs = [
    lower(trim(col(c))).alias(c) if t == "string" else col(c) 
    for (c, t) in df.dtypes
]

df.select(*exprs)

This approach has two main advantages over you current solution:

it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).

105

answered Sep 28 '22 02:09

zero323

Related questions
                            
                                How do I make ipython aware of changes made to a selfwritten module?
                            
                                Efficient Numpy computation of pairwise squared differences
                            
                                Sum all values of a counter in Python 2 [duplicate]
                            
                                How to check which database is being used in a Django project
                            
                                How to print rdd in python in spark
                            
                                How can I create a new page to confluence with Python
                            
                                How to convert (5,) numpy array to (5,1)?
                            
                                SQLAlchemy substring is in string
                            
                                How to open console in firefox python selenium?
                            
                                Set width and height of an image when inserting via worksheet.insert_image
                            
                                Using recursion with map in python
                            
                                sort() in Python using cmp
                            
                                Why does this Regex only match at the start of the line in Python? [duplicate]
                            
                                <Python> Two iterating variables in a for loop [duplicate]
                            
                                django AdminSplitDateTime valid date/time error
                            
                                Apply Border To Range Of Cells Using Openpyxl
                            
                                Delete multiple dictionaries in a list
                            
                                What is the Matlab equivalent to Python's `not in`?
                            
                                Joining strings. Generator or list comprehension?
                            
                                Selenium Post method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With