I've seen questions posted here that are similar to mine, but I'm still getting errors in my code when trying some accepted answers. I have a dataframe with three columns--created _at, text, and words (which is just tokenized version of text). See below: <img src="https://i.stack.imgur.com/542tk.png" alt="enter image description here"> Now, I have a list of companies <code>['Starbucks', 'Nvidia', 'IBM', 'Dell']</code>, and I only want to keep the rows where the text includes those words above. I've tried a few things, but with no success: <pre class="prettyprint"><code>small_DF.filter(lambda x: any(word in x.text for word in test_list)) </code></pre> Returns : TypeError: condition should be string or Column I tried creating a function and using <code>foreach()</code>: <pre class="prettyprint"><code>def filters(line): return(any(word in line for word in test_list)) df = df.foreach(filters) </code></pre> That turns df into 'Nonetype' And the last one I tried: <pre class="prettyprint"><code>df = df.filter((col("text").isin(test_list)) </code></pre> This returns an empty dataframe, which is nice as I get no error, but obviously not what I want.

Your <code>.filter</code> returns an error because it is the sql filter function (expecting a <code>BooleanType()</code> column) on dataframes not the filter function on RDDs. If you want to use the RDD one, just add <code>.rdd</code>: <pre class="prettyprint lang-py prettyprint-override"><code>small_DF.rdd.filter(lambda x: any(word in x.text for word in test_list)) </code></pre> You don't have to use a UDF, you can use regular expressions in pyspark with <code>.rlike</code> on your column <code>"text"</code>: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql import HiveContext hc = HiveContext(sc) import pyspark.sql.functions as psf words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']] data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']] df = hc.createDataFrame(data).toDF('text') df.filter(psf.lower(df.text).rlike('|'.join(words))) </code></pre>

Filtering pyspark dataframe if text column includes words in specified list

Tags:

python

pyspark

spark-dataframe

I've seen questions posted here that are similar to mine, but I'm still getting errors in my code when trying some accepted answers. I have a dataframe with three columns--created _at, text, and words (which is just tokenized version of text). See below:

enter image description here

Now, I have a list of companies ['Starbucks', 'Nvidia', 'IBM', 'Dell'], and I only want to keep the rows where the text includes those words above.

I've tried a few things, but with no success:

small_DF.filter(lambda x: any(word in x.text for word in test_list))

Returns : TypeError: condition should be string or Column

I tried creating a function and using foreach():

def filters(line):
   return(any(word in line for word in test_list))
df = df.foreach(filters)

That turns df into 'Nonetype'

And the last one I tried:

df = df.filter((col("text").isin(test_list))

This returns an empty dataframe, which is nice as I get no error, but obviously not what I want.

701

asked Apr 25 '17 15:04

sjc725

2 Answers

Your .filter returns an error because it is the sql filter function (expecting a BooleanType() column) on dataframes not the filter function on RDDs. If you want to use the RDD one, just add .rdd:

small_DF.rdd.filter(lambda x: any(word in x.text for word in test_list))

You don't have to use a UDF, you can use regular expressions in pyspark with .rlike on your column "text":

from pyspark.sql import HiveContext
hc = HiveContext(sc)
import pyspark.sql.functions as psf

words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = hc.createDataFrame(data).toDF('text')
df.filter(psf.lower(df.text).rlike('|'.join(words)))

113

answered Sep 23 '22 16:09

MaFF

I think filter isnt working becuase it expects a boolean output from lambda function and isin just compares with column. You are trying to compare list of words to list of words. Here is something that I tried can give you some direction -

# prepare some test data ==> 

words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = spark.createDataFrame(data).toDF('text')


from pyspark.sql.types import *

def intersect(row):
    # convert each word in lowecase
    row = [x.lower() for x in row.split()]
    return True if set(row).intersection(set(words)) else False


filterUDF = udf(intersect,BooleanType())
df.where(filterUDF(df.text)).show()

output :

+------------------+
|              text|
+------------------+
|  i love Starbucks|
|dell laptops rocks|
+------------------+

answered Sep 20 '22 16:09

Pushkr

Related questions
                            
                                How do you concatenate symbols in mxnet
                            
                                Python- Downloading a file from a webpage by clicking on a link
                            
                                How to find out number of unique values in a column along with count of the unique values in a dataframe?
                            
                                undefined symbol: PySlice_AdjustIndices when importing PyTorch
                            
                                Why does map over an iterable return a one-shot iterable?
                            
                                python pandas: transform start and end datetime range (stored as 2 columns) to individual rows (eqpt utilisation)
                            
                                Scrapy Middleware to ignore URL and prevent crawling
                            
                                Django's redirects app doesn't work with URL parameters
                            
                                Keras TimeDistributed - are weights shared?
                            
                                Find a root of a function in a given range
                            
                                Calculating factorial using multiple threads in Python
                            
                                Why does nesting "yield from" statements (generator delegation) produce terminating `None` value?
                            
                                Floyd-Warshall algorithm: get the shortest paths
                            
                                run pyspark locally
                            
                                Python pandas convert rows to columns where multiple columns exist [duplicate]
                            
                                Sending email with attached file in Django
                            
                                Django implementation of default value in database
                            
                                How to redirect stderr and stdout into /var/log directory in background process?
                            
                                Concating pandas dataframe
                            
                                Python: How to convert Pyspark column to date type if there are null values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With