I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.
Can anyone explain what the _
, |
, and >>
are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?
train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data) evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data) input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema) _ = (input_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata( os.path.join(output_dir, path_constants.RAW_METADATA_DIR), pipeline=pipeline)) preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold) (train_dataset, train_metadata), transform_fn = ( (train_data, input_metadata) | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset( preprocessing_fn))
In Beam, | is a synonym for apply , which applies a PTransform to a PCollection to produce a new PCollection . >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.
Apache Beam is an open-source SDK which allows you to build multiple data pipelines from batch or stream based integrations and run it in a direct or distributed way. You can add various transformations in each pipeline.
Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline. A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source. A transform represents a processing operation that transforms data.
Operators in Python can be overloaded. In Beam, |
is a synonym for apply
, which applies a PTransform
to a PCollection
to produce a new PCollection
. >>
allows you to name a step for easier display in various UIs -- the string between the |
and the >>
is only used for these display purposes and identifying that particular application.
See https://beam.apache.org/documentation/programming-guide/#transforms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With