Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explain Apache Beam python syntax

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.

Can anyone explain what the _ , | , and >> are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?

train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data) evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)  input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)  _ = (input_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(        os.path.join(output_dir, path_constants.RAW_METADATA_DIR),        pipeline=pipeline))  preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold) (train_dataset, train_metadata), transform_fn = (   (train_data, input_metadata)   | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(       preprocessing_fn)) 
like image 849
dobbysock1002 Avatar asked May 05 '17 03:05

dobbysock1002


People also ask

What does >> mean in Apache Beam?

In Beam, | is a synonym for apply , which applies a PTransform to a PCollection to produce a new PCollection . >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

What is beam in Python?

Apache Beam is an open-source SDK which allows you to build multiple data pipelines from batch or stream based integrations and run it in a direct or distributed way. You can add various transformations in each pipeline.

How does Apache Beam work?

Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline. A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source. A transform represents a processing operation that transforms data.


1 Answers

Operators in Python can be overloaded. In Beam, | is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection. >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms

like image 102
rf- Avatar answered Sep 20 '22 21:09

rf-