PySpark Row objects: accessing row elements by variable names

Question

One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11), one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element? One option is to do r.toDict()[element]. However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like

def f(row, element1, element2):
    row = row.asDict()
    return ", ".join(str(row[element1]), str(row[element2]))

result = dataframe.map(lambda row: f(row, 'age', 'name'))

However, it seems that calling toDict() on every row will be very inefficient. Is there a better way?

def f(row, element1, element2):
    row = row.asDict()
    return ", ".join(str(row[element1]), str(row[element2]))

result = dataframe.map(lambda row: f(row, 'age', 'name'))

However, it seems that calling toDict() on every row will be very inefficient. Is there a better way?

zero323 · Accepted Answer

As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__ method will be called:

from pyspark.sql import Row

a_row = Row(foo=1, bar=True)

a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True

Row also overrides __getitem__ to have the same behavior:

a_row.__getitem__("foo")
## 1

It means you can use bracket notation:

a_row["bar"]
## True

Problem is that it is not efficient. Each call is O(N) so a single conversion to dict can be more efficient if you have wide rows and multiple calls.

In general you should avoid calls like this:

using UDF is as inefficient but much cleaner in general
using built-in SQL expressions should be preferred over map
you shouldn't map directly over DataFrame. Its gonna be deprecated soon.

PySpark Row objects: accessing row elements by variable names

Tags:

python

apache-spark

pyspark

David D

1 Answers

zero323

Recent Activity

Donate For Us

PySpark Row objects: accessing row elements by variable names

Tags:

python

apache-spark

pyspark

David D

1 Answers

zero323

Related questions

Recent Activity

Donate For Us