I have the following commands in spark, <pre class="prettyprint"><code>data = sqlContext.sql("select column1, column2, column3 from table_name") words = sc.textFile("words.txt") </code></pre> <code>words.txt</code> has a bunch of words and data has three string columns taken from <code>table_name</code>. Now I want to filter out rows in data (spark dataframe) whenever the word pattern of each word from <code>words.txt</code> occurs in any of the three columns of data. For example if <code>words.txt</code> has word such as <code>gon</code> and if any of the three columns of data contains values as <code>bygone</code>, <code>gone</code> etc, I want to filter out that row. I've tried the following: <pre class="prettyprint"><code>data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas() </code></pre> This works for one word. But I want to check all the words from the <code>words.txt</code> and remove it. Is there a way to do this? I am new to PySpark. Any suggestions would be helpful.

You may read the words from the <code>words.txt</code>, and build a regex pattern like this: <pre class="prettyprint"><code>(?s)^(?=.*word1)(?=.*word2)(?=.*word3) </code></pre> etc. where <code>(?s)</code> allows <code>.</code> to match any symbol, <code>^</code> matches the string start position and then each <code>(?=...)</code> lookahead requires the presence of each word in the string. So, if you place the regex into a <code>rx</code> var, it will look like: <pre class="prettyprint"><code>data.filter(~data['column1'].rlike(rx) | data['column2'].rlike(rx) | data['column3'].rlike(rx)).toPandas() </code></pre> where the regex pattern is passed to <code>rlike</code> method that is similar to <code>like</code> but performs a search based on a regex expression.

Filter rows in Spark dataframe from the words in RDD

Tags:

python

regex

apache-spark

pyspark

spark-dataframe

I have the following commands in spark,

data = sqlContext.sql("select column1, column2, column3 from table_name")

words = sc.textFile("words.txt")

words.txt has a bunch of words and data has three string columns taken from table_name.

Now I want to filter out rows in data (spark dataframe) whenever the word pattern of each word from words.txt occurs in any of the three columns of data.

For example if words.txt has word such as gon and if any of the three columns of data contains values as bygone, gone etc, I want to filter out that row.

I've tried the following:

data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas()

This works for one word. But I want to check all the words from the words.txt and remove it. Is there a way to do this?

I am new to PySpark. Any suggestions would be helpful.

856

asked Aug 21 '16 19:08

haimen

1 Answers

You may read the words from the words.txt, and build a regex pattern like this:

(?s)^(?=.*word1)(?=.*word2)(?=.*word3)

etc. where (?s) allows . to match any symbol, ^ matches the string start position and then each (?=...) lookahead requires the presence of each word in the string.

So, if you place the regex into a rx var, it will look like:

data.filter(~data['column1'].rlike(rx) | data['column2'].rlike(rx) | data['column3'].rlike(rx)).toPandas()

where the regex pattern is passed to rlike method that is similar to like but performs a search based on a regex expression.

167

answered Oct 19 '22 06:10

Wiktor Stribiżew

Related questions
                            
                                Custom TSLint rule to check function parameter assignability (no bivariance)
                            
                                Fake plugin behavior in FakeXrmEasy?
                            
                                Removing Entity Framework Reference from MVC project
                            
                                Using a list of lists as a lookup table and updating a value in new list of lists
                            
                                How to ensure thrift objects are backward compatible?
                            
                                IE11 DOM normalize doesn't work with table row
                            
                                How to make an identic layout after splash screen
                            
                                how to minify Thymeleaf generated HTML
                            
                                Swift3: best way to validate the text entered by the user in a UITextField
                            
                                How to get the SUM of values on a JSONB column
                            
                                Is there a way to increment a count in Firebase without exposing current count to the client?
                            
                                D3 V4 TreeMap error "Uncaught Error: missing: Comedy-Musical

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With