Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark Row objects: accessing row elements by variable names

One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11), one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element? One option is to do r.toDict()[element]. However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like

def f(row, element1, element2):
    row = row.asDict()
    return ", ".join(str(row[element1]), str(row[element2]))

result = dataframe.map(lambda row: f(row, 'age', 'name'))

However, it seems that calling toDict() on every row will be very inefficient. Is there a better way?

like image 563
David D Avatar asked Mar 13 '23 20:03

David D


1 Answers

As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__ method will be called:

from pyspark.sql import Row

a_row = Row(foo=1, bar=True)

a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True

Row also overrides __getitem__ to have the same behavior:

a_row.__getitem__("foo")
## 1

It means you can use bracket notation:

a_row["bar"]
## True

Problem is that it is not efficient. Each call is O(N) so a single conversion to dict can be more efficient if you have wide rows and multiple calls.

In general you should avoid calls like this:

  • using UDF is as inefficient but much cleaner in general
  • using built-in SQL expressions should be preferred over map
  • you shouldn't map directly over DataFrame. Its gonna be deprecated soon.
like image 79
zero323 Avatar answered Mar 15 '23 13:03

zero323