Count particular characters within a column using Spark Dataframe API

Question

I have a column with bits in a Spark Dataframe df. The columns are strings of format:

10001010000000100000000000000000
10001010000000100000000100000000

Is there a simple and effective way to create a new column "no_of_ones" and count the frequency of ones using a Dataframe? Using RDDs I can map(lambda x:x.count('1')) (pyspark). Additionally, how can I retrieve a list with the position of the ones?

philantrovert · Accepted Answer

One way I can think of is to remove all zeroes and then count the length of the field.

df.show
+--------------------+
|          bytestring|
+--------------------+
|10001010000000100...|
|10001010000000100...|
+--------------------+


df.withColumn("no_of_ones" , length(regexp_replace($"bytestring", "0", "")) ).show
+--------------------+----------+
|          bytestring|no_of_ones|
+--------------------+----------+
|10001010000000100...|         4|
|10001010000000100...|         5|
+--------------------+----------+

Oli · Answer

In general, when you cannot find what you need in the predefined function of (py)spark SQL, you can write a user defined function (UDF) that does whatever you want (see UDF).

Note that in your case, a well coded udf would probably be faster than the regex solution in scala or java because you would not need to instantiate a new string and compile a regex (a for loop would do). However it would probably be much slower in pyspark because executing python code on an executor always severely damages the performance.

Count particular characters within a column using Spark Dataframe API

Tags:

apache-spark

pyspark

spark-dataframe

Andi Anderle

2 Answers

philantrovert

Oli

Recent Activity

Donate For Us

Count particular characters within a column using Spark Dataframe API

Tags:

apache-spark

pyspark

spark-dataframe

Andi Anderle

2 Answers

philantrovert

Oli

Related questions

Recent Activity

Donate For Us