Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of words in a spark dataframe

How can we find the number of words in a column of a spark dataframe without using REPLACE() function of SQL ? Below is the code and input I am working with but the replace() function does not work.

from pyspark.sql import SparkSession
my_spark = SparkSession \
    .builder \
    .appName("Python Spark SQL example") \
    .enableHiveSupport() \
    .getOrCreate()

parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)

tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")

print(tuesdaycrimes.show())


+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|              Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|          X|        Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|
like image 817
Hrishikesh Sarma Avatar asked Feb 22 '18 12:02

Hrishikesh Sarma


People also ask

How do you count strings in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do I count the number of records in a Spark data frame?

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame.

Is count a Spark action?

Introduction to PySpark Count. PySpark Count is a PySpark function that is used to Count the number of elements present in the PySpark data model. This count function is used to return the number of elements in the data. It is an action operation in PySpark that counts the number of Rows in the PySpark data model.


2 Answers

There are number of ways to count the words using pyspark DataFrame functions, depending on what it is you are looking for.

Create Example Data

import pyspark.sql.functions as f
data = [
    ("2015-05-14 03:53:00", "WARRANT ARREST"),
    ("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
    ("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]

df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()

In this example, we will count the words in the Description column.

Count in each row

If you wanted the count of words in the specified column for each row you can create a new column using withColumn() and do the following:

  • Use pyspark.sql.functions.split() to break the string into a list
  • Use pyspark.sql.functions.size() to count the length of the list

For example:

df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#|              Dates|      Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00|   WARRANT ARREST|        2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION|        2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION|        2|
#+-------------------+-----------------+---------+

Sum word count over all rows

If you wanted to count the total number of words in the column across the entire DataFrame, you can use pyspark.sql.functions.sum():

df.select(f.sum('wordCount')).collect() 
#[Row(sum(wordCount)=6)]

Count occurrence of each word

If you wanted the count of each word in the entire DataFrame, you can use split() and pyspark.sql.function.explode() followed by a groupBy and count().

df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)\
    .show()
#+---------+-----+
#|     word|count|
#+---------+-----+
#|  TRAFFIC|    2|
#|VIOLATION|    2|
#|  WARRANT|    1|
#|   ARREST|    1|
#+---------+-----+
like image 55
pault Avatar answered Sep 19 '22 11:09

pault


You can do it just using split and size of pyspark API functions (Below is example):-

sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()

Below is Output:-
+------------------+
|size(split(_1,  ))|
+------------------+
|                 5|
|                 2|
+------------------+
like image 22
Rakesh Kumar Avatar answered Sep 19 '22 11:09

Rakesh Kumar