Count number of words in a spark dataframe

Tags:

How can we find the number of words in a column of a spark dataframe without using REPLACE() function of SQL ? Below is the code and input I am working with but the replace() function does not work.

from pyspark.sql import SparkSession
my_spark = SparkSession \
    .builder \
    .appName("Python Spark SQL example") \
    .enableHiveSupport() \
    .getOrCreate()

parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)

tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")

print(tuesdaycrimes.show())


+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|              Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|          X|        Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|

817

asked Feb 22 '18 12:02

Hrishikesh Sarma

2 Answers

There are number of ways to count the words using pyspark DataFrame functions, depending on what it is you are looking for.

Create Example Data

import pyspark.sql.functions as f
data = [
    ("2015-05-14 03:53:00", "WARRANT ARREST"),
    ("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
    ("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]

df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()

In this example, we will count the words in the Description column.

Count in each row

If you wanted the count of words in the specified column for each row you can create a new column using withColumn() and do the following:

Use pyspark.sql.functions.split() to break the string into a list
Use pyspark.sql.functions.size() to count the length of the list

For example:

df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#|              Dates|      Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00|   WARRANT ARREST|        2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION|        2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION|        2|
#+-------------------+-----------------+---------+

Sum word count over all rows

If you wanted to count the total number of words in the column across the entire DataFrame, you can use pyspark.sql.functions.sum():

df.select(f.sum('wordCount')).collect() 
#[Row(sum(wordCount)=6)]

Count occurrence of each word

If you wanted the count of each word in the entire DataFrame, you can use split() and pyspark.sql.function.explode() followed by a groupBy and count().

df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)\
    .show()
#+---------+-----+
#|     word|count|
#+---------+-----+
#|  TRAFFIC|    2|
#|VIOLATION|    2|
#|  WARRANT|    1|
#|   ARREST|    1|
#+---------+-----+

answered Sep 19 '22 11:09

pault

You can do it just using split and size of pyspark API functions (Below is example):-

sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()

Below is Output:-
+------------------+
|size(split(_1,  ))|
+------------------+
|                 5|
|                 2|
+------------------+

answered Sep 19 '22 11:09

Rakesh Kumar

Related questions
                            
                                Python docx Lib Center Align image
                            
                                lambda is slower than function call in python, why
                            
                                Pydoc not seeing docstrings?
                            
                                Python: False vs 0
                            
                                Django ALLOWED_HOSTS with ELB HealthCheck
                            
                                Improper use of __new__ to generate classes?
                            
                                Pyplot / matplotlib line plot - same color
                            
                                Python's `range` function with 3 parameters
                            
                                Save form data in Django
                            
                                How do I find the number of vertices in a graph created by iGraph in python?
                            
                                Extracting a random sublist from a list in Python
                            
                                Map list from dictionaries
                            
                                Django rest framework serializer is valid always false
                            
                                AttributeError: 'str' object has no attribute 'loads', json.loads()
                            
                                How to implement a async grpc python server?
                            
                                How to get string from a django.utils.safestring.SafeText
                            
                                Image clustering by its similarity in python
                            
                                How to loop through a python list in batch?
                            
                                How to logout in django?
                            
                                How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count number of words in a spark dataframe

Tags:

python

apache-spark

apache-spark-sql

pyspark

Hrishikesh Sarma

People also ask

2 Answers

pault

Rakesh Kumar

Recent Activity

Donate For Us