I am trying to add a column to a dataframe that indicates when two different values are both found in a nested array
 expr1 = array_contains(df.child_list, "value1")
 expr2 = array_contains(df.child_list, "value2")
I got it to work with an ampersand operator
 df.select(...).withColumn("boolTest", expr1 & expr2)
Then I tried to replace this with bitwiseAND with the thought being, I would want to have a list of these expressions ANDed together dynamically.
This fails with an error
 df.select(...).withColumn("boolTest", expr1.bitwiseAND(expr2))
 cannot resolve ..... due to data type mismatch: '(array_contains(c1.`child_list`, 'value1') & 
array_contains(c1.`child_list`, 'value2'))' requires integral type, 
not boolean;;
What's the distinction and what am I doing wrong?
The & and | operators work on BooleanType columns in pyspark operate as logical AND and OR operations. In other words they take True/False as input and output True/False.
The bitwiseAND functions does bit by bit AND'ing of two numeric values. So they could take two integers and output the bitwise AND'ing of them.
Here is an example of each:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType([   
  StructField("b1", BooleanType()), 
  StructField("b2", BooleanType()),
  StructField("int1", IntegerType()), 
  StructField("int2", IntegerType())
])
data = [
  (True, True, 0x01, 0x01), 
  (True, False, 0xFF, 0xA), 
  (False, False, 0x01, 0x00)
]
df = sqlContext.createDataFrame(sc.parallelize(data), schema)
df2 = df.withColumn("logical", df.b1 & df.b2) \
        .withColumn("bitwise", df.int1.bitwiseAND(df.int2))
df2.printSchema()
df2.show()
+-----+-----+----+----+-------+-------+
|   b1|   b2|int1|int2|logical|bitwise|
+-----+-----+----+----+-------+-------+
| true| true|   1|   1|   true|      1|
| true|false| 255|  10|  false|     10|
|false|false|   1|   0|  false|      0|
+-----+-----+----+----+-------+-------+
>>> df2.printSchema()
root
 |-- b1: boolean (nullable = true)
 |-- b2: boolean (nullable = true)
 |-- int1: integer (nullable = true)
 |-- int2: integer (nullable = true)
 |-- logical: boolean (nullable = true)
 |-- bitwise: integer (nullable = true)
If you want to dynamically AND together a list of columns, you can do it like this:
columns = [col("b1"), col("b2")]
df.withColumn("result", reduce(lambda a, b: a & b, columns))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With