<p>I have a Spark dataframe with a column (<code>assigned_products</code>) of type string that contains values such as the following:</p> <pre class="prettyprint"><code>"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING" </code></pre> <p>I would like to count the occurrences of <code>+</code> in the string for and return that value in a new column.</p> <p>I tried the following, but I keep returning errors.</p> <pre class="prettyprint"><code>from pyspark.sql.functions import col DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+")) </code></pre> <p>I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.</p>

<p>Replace will replace the occurrence of the sub-string with null string. So we can count the occurrences by comparing the lengths before and after the replacement as follows:</p> <h3>Using SparkSQL:</h3> <pre class="prettyprint"><code>SELECT length(x) - length(replace(x,'+')) as substring_count FROM (select 'abc+def+ghi++aaa' as x) -- Sample data </code></pre> <p><strong>Output:</strong></p> <pre class="prettyprint"><code>substring_count --------------- 4 </code></pre> <h3>Using PySpark functions:</h3> <pre class="prettyprint"><code>import pyspark.sql.functions as F df1 = spark.sql("select 'abc+def+ghi++aaa' as x") # Sample data df1.withColumn('substring_count', F.length(col('x')) - F.length(F.regexp_replace(col('x'), '\+', '')) ).show() </code></pre> <p><strong>Output:</strong></p> <pre class="prettyprint"><code>+----------------+---------------+ | x|substring_count| +----------------+---------------+ |abc+def+ghi++aaa| 4| +----------------+---------------+ </code></pre>

Spark Dataframe - Python - count substring in string

Tags:

python

string

apache-spark

apache-spark-sql

pyspark

I have a Spark dataframe with a column (assigned_products) of type string that contains values such as the following:

"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING"

I would like to count the occurrences of + in the string for and return that value in a new column.

I tried the following, but I keep returning errors.

from pyspark.sql.functions import col

DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+"))

I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.

511

asked Jul 20 '18 20:07

Joshua Hernandez

1 Answers

Replace will replace the occurrence of the sub-string with null string. So we can count the occurrences by comparing the lengths before and after the replacement as follows:

Using SparkSQL:

SELECT length(x) - length(replace(x,'+')) as substring_count
FROM  (select 'abc+def+ghi++aaa' as x) -- Sample data

Output:

substring_count
---------------
4

Using PySpark functions:

import pyspark.sql.functions as F

df1 = spark.sql("select 'abc+def+ghi++aaa' as x") # Sample data
df1.withColumn('substring_count', 
                F.length(col('x')) 
               - F.length(F.regexp_replace(col('x'), '\+', '')) 
              ).show()

Output:

+----------------+---------------+
|               x|substring_count|
+----------------+---------------+
|abc+def+ghi++aaa|              4|
+----------------+---------------+

answered Sep 19 '22 17:09

Kent Pawar

Related questions
                            
                                How to serialize an Exception
                            
                                specific location for inset axes
                            
                                Django logging requests
                            
                                Understanding requests versus grequests
                            
                                same package installed by both pip and conda
                            
                                Adding node labels to bokeh network plots
                            
                                How to use numpy.argsort() as indices in more than 2 dimensions?
                            
                                Python type-hinting, indexable object
                            
                                The relationship between thread and process in multi-process program
                            
                                How can I use Regex to find a string of characters in alphabetical order using Python?
                            
                                Avoiding pylint complaints when importing Python packages from submodules
                            
                                How do I split a string into several columns in a dataframe with pandas Python?
                            
                                How to profile CPU usage of a Python script?
                            
                                Resnet network doesn't work as expected
                            
                                How do I add a Title to a Seaborn Clustermap?
                            
                                extract human vocals from song
                            
                                How to reduce the size of packaged python zip files for AWS Lambda
                            
                                Generate unique binary permutations in python
                            
                                sort_values() got an unexpected keyword argument 'by'
                            
                                Why does the python pathlib Path('').exists() return True?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With