I am developing a workflow process to run on Google Cloud Dataflow using the Python SDK from Apache Beam.
When running locally the workflow successfully completes with no errors and the data output is exactly as expected.
When I try to run on Dataflow service, it throws the following error:
AttributeError: '_UnwindowedValues' object has no attribute 'sort'
Which is from the following piece of code:
class OrderByDate(beam.DoFn):
def process(self, context):
(k, v) = context.element
v.sort(key=operator.itemgetter('date'))
return [(k, v)]
And this is called using the standard beam.ParDo
like so:
'order_by_dates' >> beam.ParDo(OrderByDate())
The data in the (k, v)
tuple looks like this example:
('SOME CODE', {'date':'2017-01-01', 'value':1, 'date':'2016-12-14', 'value':4})
With v
being the object of dates and values
I have tried switching to a standard lambda function also throws the same error.
Any ideas why this is running differently locally vs on Dataflow? Or suggested work around.
Found a solution, I needed to specifically convert v
to a list list(v)
before doing the sort, which worked.
Strange the differences between running local vs remote.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With