Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataframe - Python - count substring in string

I have a Spark dataframe with a column (assigned_products) of type string that contains values such as the following:

"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING"

I would like to count the occurrences of + in the string for and return that value in a new column.

I tried the following, but I keep returning errors.

from pyspark.sql.functions import col

DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+"))

I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.

like image 511
Joshua Hernandez Avatar asked Jul 20 '18 20:07

Joshua Hernandez


People also ask

How do you check if a substring is present in a string in Pyspark?

The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not.

How do you count strings in Pyspark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do I count the number of records in a Spark data frame?

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.


1 Answers

Replace will replace the occurrence of the sub-string with null string. So we can count the occurrences by comparing the lengths before and after the replacement as follows:

Using SparkSQL:

SELECT length(x) - length(replace(x,'+')) as substring_count
FROM  (select 'abc+def+ghi++aaa' as x) -- Sample data

Output:

substring_count
---------------
4

Using PySpark functions:

import pyspark.sql.functions as F

df1 = spark.sql("select 'abc+def+ghi++aaa' as x") # Sample data
df1.withColumn('substring_count', 
                F.length(col('x')) 
               - F.length(F.regexp_replace(col('x'), '\+', '')) 
              ).show()

Output:

+----------------+---------------+
|               x|substring_count|
+----------------+---------------+
|abc+def+ghi++aaa|              4|
+----------------+---------------+
like image 58
Kent Pawar Avatar answered Sep 19 '22 17:09

Kent Pawar