I have a column with bits in a Spark Dataframe df. The columns are strings of format:
10001010000000100000000000000000
10001010000000100000000100000000
Is there a simple and effective way to create a new column "no_of_ones" and count the frequency of ones using a Dataframe? Using RDDs I can map(lambda x:x.count('1')) (pyspark).
Additionally, how can I retrieve a list with the position of the ones?
One way I can think of is to remove all zeroes and then count the length of the field.
df.show
+--------------------+
|          bytestring|
+--------------------+
|10001010000000100...|
|10001010000000100...|
+--------------------+
df.withColumn("no_of_ones" , length(regexp_replace($"bytestring", "0", "")) ).show
+--------------------+----------+
|          bytestring|no_of_ones|
+--------------------+----------+
|10001010000000100...|         4|
|10001010000000100...|         5|
+--------------------+----------+
In general, when you cannot find what you need in the predefined function of (py)spark SQL, you can write a user defined function (UDF) that does whatever you want (see UDF).
Note that in your case, a well coded udf would probably be faster than the regex solution in scala or java because you would not need to instantiate a new string and compile a regex (a for loop would do). However it would probably be much slower in pyspark because executing python code on an executor always severely damages the performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With