I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string. <pre class="prettyprint"><code>Input: ID History 1 USA|UK|IND|DEN|MAL|SWE|AUS 2 USA|UK|PAK|NOR 3 NOR|NZE 4 IND|PAK|NOR lst=['USA','IND','DEN'] Output : ID History Count 1 USA|UK|IND|DEN|MAL|SWE|AUS 3 2 USA|UK|PAK|NOR 1 3 NOR|NZE 0 4 IND|PAK|NOR 1 </code></pre>

If you are using Spark 2.4+, you can try the SPARK SQL higher order function <code>filter()</code>: <pre class="prettyprint"><code>from pyspark.sql import functions as F >>> df.show(5,0) +---+--------------------------+ |ID |History | +---+--------------------------+ |1 |USA|UK|IND|DEN|MAL|SWE|AUS| |2 |USA|UK|PAK|NOR | |3 |NOR|NZE | |4 |IND|PAK|NOR | +---+--------------------------+ df_new = df.withColumn('data', F.split('History', '\|')) \ .withColumn('cnt', F.expr('size(filter(data, x -> x in ("USA", "IND", "DEN")))')) >>> df_new.show(5,0) +---+--------------------------+----------------------------------+---+ |ID |History |data |cnt| +---+--------------------------+----------------------------------+---+ |1 |USA|UK|IND|DEN|MAL|SWE|AUS|[USA, UK, IND, DEN, MAL, SWE, AUS]|3 | |2 |USA|UK|PAK|NOR |[USA, UK, PAK, NOR] |1 | |3 |NOR|NZE |[NOR, NZE] |0 | |4 |IND|PAK|NOR |[IND, PAK, NOR] |1 | +---+--------------------------+----------------------------------+---+ </code></pre> Where we first split the field <code>History</code> into an array column called <code>data</code> and then use the filter function: <pre class="prettyprint"><code>filter(data, x -> x in ("USA", "IND", "DEN")) </code></pre> to retrieve only array elements which satisfy the condition: <code>IN ("USA", "IND", "DEN")</code>, after that, we count the resulting array with <code>size()</code> function. UPDATE: Added another way to use array_contains() which should works for old version Spark: <pre class="prettyprint"><code>lst = ["USA", "IND", "DEN"] df_new = df.withColumn('data', F.split('History', '\|')) \ .withColumn('Count', sum([F.when(F.array_contains('data',e),1).otherwise(0) for e in lst])) </code></pre> Note: duplicate entries in arrays will be skipped, this method only counts unique Country code.

Count occurrences of a list of substrings in a pyspark df column

Tags:

python

pyspark

pyspark-sql

hive

I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string.

Input:          
       ID    History

       1     USA|UK|IND|DEN|MAL|SWE|AUS
       2     USA|UK|PAK|NOR
       3     NOR|NZE
       4     IND|PAK|NOR

 lst=['USA','IND','DEN']


Output :
       ID    History                      Count

       1     USA|UK|IND|DEN|MAL|SWE|AUS    3
       2     USA|UK|PAK|NOR                1
       3     NOR|NZE                       0
       4     IND|PAK|NOR                   1

377

asked Jul 16 '19 05:07

Faliha Zikra

2 Answers

# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History                   |
+---+--------------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |
|3  |NOR|NZE                   |
|4  |IND|PAK|NOR               |
+---+--------------------------+

The idea is to split the string based on these three delimiters: lst=['USA','IND','DEN'] and then count the number of substrings produced.

For eg; the string USA|UK|IND|DEN|MAL|SWE|AUS gets split like - ,, |UK|, |, |MAL|SWE|AUS. Since, there were 4 substrings created and there were 3 delimiters matches, so 4-1 = 3 gives the count of these strings appearing in the column string.

I am not sure if multi character delimiters are supported in Spark, so as a first step, we replace any of these 3 sub-strings in the list ['USA','IND','DEN'] with a flag/dummy value %. You could use something else as well. The following code does this replacement -

df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:
    df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History                   |History_X           |
+---+--------------------------+--------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |%|UK|PAK|NOR        |
|3  |NOR|NZE                   |NOR|NZE             |
|4  |IND|PAK|NOR               |%|PAK|NOR           |
+---+--------------------------+--------------------+

Finally, we count the number of substrings created by splitting it first with % being the delimiter, then counting the number of substrings created with size function and finally subtracting 1 from it.

df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History                   |Count|
+---+--------------------------+-----+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|3    |
|2  |USA|UK|PAK|NOR            |1    |
|3  |NOR|NZE                   |0    |
|4  |IND|PAK|NOR               |1    |
+---+--------------------------+-----+

110

answered Sep 20 '22 16:09

cph_sto

If you are using Spark 2.4+, you can try the SPARK SQL higher order function filter():

from pyspark.sql import functions as F

>>> df.show(5,0)
+---+--------------------------+
|ID |History                   |
+---+--------------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |
|3  |NOR|NZE                   |
|4  |IND|PAK|NOR               |
+---+--------------------------+

df_new = df.withColumn('data', F.split('History', '\|')) \
           .withColumn('cnt', F.expr('size(filter(data, x -> x in ("USA", "IND", "DEN")))'))

>>> df_new.show(5,0)
+---+--------------------------+----------------------------------+---+
|ID |History                   |data                              |cnt|
+---+--------------------------+----------------------------------+---+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|[USA, UK, IND, DEN, MAL, SWE, AUS]|3  |
|2  |USA|UK|PAK|NOR            |[USA, UK, PAK, NOR]               |1  |
|3  |NOR|NZE                   |[NOR, NZE]                        |0  |
|4  |IND|PAK|NOR               |[IND, PAK, NOR]                   |1  |
+---+--------------------------+----------------------------------+---+

Where we first split the field History into an array column called data and then use the filter function:

filter(data, x -> x in ("USA", "IND", "DEN"))

to retrieve only array elements which satisfy the condition: IN ("USA", "IND", "DEN"), after that, we count the resulting array with size() function.

UPDATE: Added another way to use array_contains() which should works for old version Spark:

lst = ["USA", "IND", "DEN"]

df_new = df.withColumn('data', F.split('History', '\|')) \
           .withColumn('Count', sum([F.when(F.array_contains('data',e),1).otherwise(0) for e in lst]))

Note: duplicate entries in arrays will be skipped, this method only counts unique Country code.

answered Sep 21 '22 16:09

jxc

Related questions
                            
                                Create new variables from row for each existing variable in pandas dataframe
                            
                                How to override a pytest fixture calling the original in pytest 4
                            
                                kmodes VS one-hot encoding + kmeans for categorical data?
                            
                                Generalize a function in python
                            
                                Python: Faster or Loop-Free Way of Assigning Points to Bins?
                            
                                Pandas concat columns
                            
                                Numpy deep copy still altering original array
                            
                                Librosa's fft and Scipy's fft are different?
                            
                                Airflow - ModuleNotFoundError: No module named 'kubernetes'
                            
                                Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize
                            
                                How to use only one GPU for tensorflow session?
                            
                                pandas dataframe with list elements: split, pad
                            
                                disable `functools.lru_cache` from inside function
                            
                                How create a camera on PyOpenGL that can do "perspective rotations" on mouse movements?
                            
                                Why am I getting "The lock supplied is invalid." error when I am trying to delete queue message using LockTocken
                            
                                No python at '\python.exe'
                            
                                Pandas Dataframe How to cut off float decimal points without rounding?
                            
                                Pandas pd.to_datetime only keep time do not date
                            
                                Python Extract a decimal number before a specific substring
                            
                                Pandas - Row number since last greater than 0 value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With