Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count particular characters within a column using Spark Dataframe API

I have a column with bits in a Spark Dataframe df. The columns are strings of format:

10001010000000100000000000000000
10001010000000100000000100000000

Is there a simple and effective way to create a new column "no_of_ones" and count the frequency of ones using a Dataframe? Using RDDs I can map(lambda x:x.count('1')) (pyspark). Additionally, how can I retrieve a list with the position of the ones?

like image 288
Andi Anderle Avatar asked Nov 07 '17 10:11

Andi Anderle


2 Answers

One way I can think of is to remove all zeroes and then count the length of the field.

df.show
+--------------------+
|          bytestring|
+--------------------+
|10001010000000100...|
|10001010000000100...|
+--------------------+


df.withColumn("no_of_ones" , length(regexp_replace($"bytestring", "0", "")) ).show
+--------------------+----------+
|          bytestring|no_of_ones|
+--------------------+----------+
|10001010000000100...|         4|
|10001010000000100...|         5|
+--------------------+----------+
like image 194
philantrovert Avatar answered Sep 23 '22 01:09

philantrovert


In general, when you cannot find what you need in the predefined function of (py)spark SQL, you can write a user defined function (UDF) that does whatever you want (see UDF).

Note that in your case, a well coded udf would probably be faster than the regex solution in scala or java because you would not need to instantiate a new string and compile a regex (a for loop would do). However it would probably be much slower in pyspark because executing python code on an executor always severely damages the performance.

like image 26
Oli Avatar answered Sep 21 '22 01:09

Oli