I'm reading the Apache Beam programming guide, which starts off very excellent but becomes a bit harder to get through starting with the Schemas section.
My main question here is: Are schemas relevant if you are using Beam in Python? It seems like they might only be relevant if you are using a strongly typed language like Java, but I'm not sure. And while the programming guide is good about using different wording for Java vs Python early on in the guide, once you get to the Schemas section it is focused entirely on Java. So it's hard for me to tell if this is a topic I should know anything about if I am using Python.
Here is the section of the guide I am asking about: https://beam.apache.org/documentation/programming-guide/#schemas
You're right, this section is missing details for Python. Schemas are definitely useful in Beam Python, however. You can do things like:
# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import beam
input = beam.Row(my_first_row="x", my_second_row=1)
with beam.pipeline() as pipeline:
(pipeline | "Create input" >> beam.Create(input)
| "Select output data" >> beam.Select(
new_first_row=lambda x: "Row: " + x.my_first_row,
new_second_row=lambda x: x.my_second_row + 1))
If you're using regular beam I find this alone to make it worth using schemas. Plus when you add keys, you can keep the values as beam.Rows or list of beam.Rows which is really convenient.
That being said, if you're new to Beam, I would definitely recommend checking out Beam Dataframes in Python [link]. This allows you operate on PCollections as if they were Pandas Dataframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With