One can access PySpark Row
elements using the dot notation: given r= Row(name="Alice", age=11)
, one can get the name or the age using r.name
or r.age
respectively. What happens when one needs to get an element whose name is stored in a variable element
? One option is to do r.toDict()[element]
. However, consider a situation where we have a large DataFrame
and we wish to map a function on each row of that data frame. We can certainly do something like
def f(row, element1, element2):
row = row.asDict()
return ", ".join(str(row[element1]), str(row[element2]))
result = dataframe.map(lambda row: f(row, 'age', 'name'))
However, it seems that calling toDict()
on every row will be very inefficient. Is there a better way?
As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__
method will be called:
from pyspark.sql import Row
a_row = Row(foo=1, bar=True)
a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True
Row also overrides __getitem__
to have the same behavior:
a_row.__getitem__("foo")
## 1
It means you can use bracket notation:
a_row["bar"]
## True
Problem is that it is not efficient. Each call is O(N) so a single conversion to dict
can be more efficient if you have wide rows and multiple calls.
In general you should avoid calls like this:
map
DataFrame
. Its gonna be deprecated soon.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With