Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

are Beam Schemas relevant in Python?

Tags:

apache-beam

I'm reading the Apache Beam programming guide, which starts off very excellent but becomes a bit harder to get through starting with the Schemas section.

My main question here is: Are schemas relevant if you are using Beam in Python? It seems like they might only be relevant if you are using a strongly typed language like Java, but I'm not sure. And while the programming guide is good about using different wording for Java vs Python early on in the guide, once you get to the Schemas section it is focused entirely on Java. So it's hard for me to tell if this is a topic I should know anything about if I am using Python.

Here is the section of the guide I am asking about: https://beam.apache.org/documentation/programming-guide/#schemas

like image 848
Stephen Avatar asked Feb 23 '26 21:02

Stephen


1 Answers

You're right, this section is missing details for Python. Schemas are definitely useful in Beam Python, however. You can do things like:

# Copyright 2022 Google LLC. 
# SPDX-License-Identifier: Apache-2.0

import beam

input = beam.Row(my_first_row="x", my_second_row=1)

with beam.pipeline() as pipeline:
  (pipeline | "Create input" >> beam.Create(input) 
            | "Select output data" >> beam.Select(
               new_first_row=lambda x: "Row: " + x.my_first_row, 
               new_second_row=lambda x: x.my_second_row + 1))

If you're using regular beam I find this alone to make it worth using schemas. Plus when you add keys, you can keep the values as beam.Rows or list of beam.Rows which is really convenient.

That being said, if you're new to Beam, I would definitely recommend checking out Beam Dataframes in Python [link]. This allows you operate on PCollections as if they were Pandas Dataframes.

like image 153
alift-advantage Avatar answered Feb 27 '26 03:02

alift-advantage