Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows in Spark dataframe from the words in RDD

I have the following commands in spark,

data = sqlContext.sql("select column1, column2, column3 from table_name")

words = sc.textFile("words.txt")

words.txt has a bunch of words and data has three string columns taken from table_name.

Now I want to filter out rows in data (spark dataframe) whenever the word pattern of each word from words.txt occurs in any of the three columns of data.

For example if words.txt has word such as gon and if any of the three columns of data contains values as bygone, gone etc, I want to filter out that row.

I've tried the following:

data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas()

This works for one word. But I want to check all the words from the words.txt and remove it. Is there a way to do this?

I am new to PySpark. Any suggestions would be helpful.

like image 856
haimen Avatar asked Aug 21 '16 19:08

haimen


People also ask

How do you filter rows in a data frame?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.

How do you filter strings in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.


1 Answers

You may read the words from the words.txt, and build a regex pattern like this:

(?s)^(?=.*word1)(?=.*word2)(?=.*word3)

etc. where (?s) allows . to match any symbol, ^ matches the string start position and then each (?=...) lookahead requires the presence of each word in the string.

So, if you place the regex into a rx var, it will look like:

data.filter(~data['column1'].rlike(rx) | data['column2'].rlike(rx) | data['column3'].rlike(rx)).toPandas()

where the regex pattern is passed to rlike method that is similar to like but performs a search based on a regex expression.

like image 167
Wiktor Stribiżew Avatar answered Oct 19 '22 06:10

Wiktor Stribiżew