I have a Spark dataframe with the following structure. The bodyText_token has the tokens (processed/set of words). And I have a nested list of defined keywords <pre class="prettyprint"><code>root |-- id: string (nullable = true) |-- body: string (nullable = true) |-- bodyText_token: array (nullable = true) keyword_list=[['union','workers','strike','pay','rally','free','immigration',], ['farmer','plants','fruits','workers'],['outside','field','party','clothes','fashions']] </code></pre> I needed to check how many tokens fall under each keyword list and add the result as a new column of the existing dataframe. Eg: if <code>tokens =["become", "farmer","rally","workers","student"]</code> the result will be -> <code>[1,2,0]</code> The following function worked as expected. <pre class="prettyprint"><code>def label_maker_topic(tokens,topic_words): twt_list = [] for i in range(0, len(topic_words)): count = 0 #print(topic_words[i]) for tkn in tokens: if tkn in topic_words[i]: count += 1 twt_list.append(count) return twt_list </code></pre> I used udf under <code>withColumn</code> to access the function and I get an error. I think it's about passing an external list to a udf. Is there a way I can pass the external list and the dataframe column to a udf and add a new column to my dataframe? <pre class="prettyprint"><code>topicWord = udf(label_maker_topic,StringType()) myDF=myDF.withColumn("topic_word_count",topicWord(myDF.bodyText_token,keyword_list)) </code></pre>

The cleanest solution is to pass additional arguments using closure: <pre class="prettyprint"><code>def make_topic_word(topic_words): return udf(lambda c: label_maker_topic(c, topic_words)) df = sc.parallelize([(["union"], )]).toDF(["tokens"]) (df.withColumn("topics", make_topic_word(keyword_list)(col("tokens"))) .show()) </code></pre> This doesn't require any changes in <code>keyword_list</code> or the function you wrap with UDF. You can also use this method to pass an arbitrary object. This can be used to pass for example a list of <code>sets</code> for efficient lookups. If you want to use your current UDF and pass <code>topic_words</code> directly you'll have to convert it to a column literal first: <pre class="prettyprint"><code>from pyspark.sql.functions import array, lit ks_lit = array(*[array(*[lit(k) for k in ks]) for ks in keyword_list]) df.withColumn("ad", topicWord(col("tokens"), ks_lit)).show() </code></pre> Depending on your data and requirements there can alternative, more efficient solutions, which don't require UDFs (explode + aggregate + collapse) or lookups (hashing + vector operations).

Passing a data frame column and external list to udf under withColumn

Tags:

I have a Spark dataframe with the following structure. The bodyText_token has the tokens (processed/set of words). And I have a nested list of defined keywords

root
 |-- id: string (nullable = true)
 |-- body: string (nullable = true)
 |-- bodyText_token: array (nullable = true)

keyword_list=[['union','workers','strike','pay','rally','free','immigration',],
['farmer','plants','fruits','workers'],['outside','field','party','clothes','fashions']]

I needed to check how many tokens fall under each keyword list and add the result as a new column of the existing dataframe. Eg: if tokens =["become", "farmer","rally","workers","student"] the result will be -> [1,2,0]

The following function worked as expected.

def label_maker_topic(tokens,topic_words):
    twt_list = []
    for i in range(0, len(topic_words)):
        count = 0
        #print(topic_words[i])
        for tkn in tokens:
            if tkn in topic_words[i]:
                count += 1
        twt_list.append(count)
    
    return twt_list

I used udf under withColumn to access the function and I get an error. I think it's about passing an external list to a udf. Is there a way I can pass the external list and the dataframe column to a udf and add a new column to my dataframe?

topicWord = udf(label_maker_topic,StringType())
myDF=myDF.withColumn("topic_word_count",topicWord(myDF.bodyText_token,keyword_list))

776

asked May 24 '16 09:05

Jay

1 Answers

The cleanest solution is to pass additional arguments using closure:

def make_topic_word(topic_words):
     return udf(lambda c: label_maker_topic(c, topic_words))

df = sc.parallelize([(["union"], )]).toDF(["tokens"])

(df.withColumn("topics", make_topic_word(keyword_list)(col("tokens")))
    .show())

This doesn't require any changes in keyword_list or the function you wrap with UDF. You can also use this method to pass an arbitrary object. This can be used to pass for example a list of sets for efficient lookups.

If you want to use your current UDF and pass topic_words directly you'll have to convert it to a column literal first:

from pyspark.sql.functions import array, lit

ks_lit = array(*[array(*[lit(k) for k in ks]) for ks in keyword_list])
df.withColumn("ad", topicWord(col("tokens"), ks_lit)).show()

Depending on your data and requirements there can alternative, more efficient solutions, which don't require UDFs (explode + aggregate + collapse) or lookups (hashing + vector operations).

181

answered Nov 03 '22 02:11

zero323

Related questions
                            
                                UITableview Scrolls to Top on Reload
                            
                                Spring Boot 1.4 testing with Security enabled?
                            
                                How to setup permissions for S3 event to SNS topic?
                            
                                How to include the end date in a DatePeriod?
                            
                                javascript - find unique objects in array based on multiple properties
                            
                                How to debug programs with "sudo" in VSCODE
                            
                                Where is `CV_HAAR_SCALE_IMAGE` in OpenCV 3.1.0 with Python 3.5?
                            
                                VS Code, change NodeJS version for debugger
                            
                                Understanding D3 domain and ranges
                            
                                Avoid copying all constructors in subclasses
                            
                                Docker: Couldn't connect to docker daemon at http+docker://localunixsocket -is it running?
                            
                                How to get the http referer in laravel?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Passing a data frame column and external list to udf under withColumn

Tags:

Jay

People also ask

1 Answers

zero323

Recent Activity

Donate For Us