I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use <code>sc.parallelise</code> and when we apply some action we get the value that we can inspect. Similarly when I was reading about Apache Beam, I found that we can create a <code>PCollection</code> and work with it using following syntax <pre class="prettyprint"><code>with beam.Pipeline() as pipeline: lines = pipeline | beam.Create(["this is test", "this is another test"]) word_count = (lines | "Word" >> beam.ParDo(lambda line: line.split(" ")) | "Pair of One" >> beam.Map(lambda w: (w, 1)) | "Group" >> beam.GroupByKey() | "Count" >> beam.Map(lambda (w, o): (w, sum(o)))) result = pipeline.run() </code></pre> I actually wanted to print the result to console. But I couldn't find any documentation around it. <blockquote> Is there a way to print the result to console instead of saving it to a file each time? </blockquote>

You don't need the temp list. In python 2.7 the following should be sufficient: <pre class="prettyprint"><code>def print_row(row): print row (pipeline | ... | "print" >> beam.Map(print_row) ) result = pipeline.run() result.wait_until_finish() </code></pre> In python 3.x, <code>print</code> is a function so the following is sufficient: <pre class="prettyprint"><code>(pipeline | ... | "print" >> beam.Map(print) ) result = pipeline.run() result.wait_until_finish() </code></pre>

After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices. So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value <pre class="prettyprint"><code>import apache_beam as beam # lets have a sample string data = ["this is sample data", "this is yet another sample data"] # create a pipeline pipeline = beam.Pipeline() counts = (pipeline | "create" >> beam.Create(data) | "split" >> beam.ParDo(lambda row: row.split(" ")) | "pair" >> beam.Map(lambda w: (w, 1)) | "group" >> beam.CombinePerKey(sum)) # lets collect our result with a map transformation into output array output = [] def collect(row): output.append(row) return True counts | "print" >> beam.Map(collect) # Run the pipeline result = pipeline.run() # lets wait until result a available result.wait_until_finish() # print the output print output </code></pre>

Collecting output from Apache Beam pipeline and displaying it to console

Tags:

apache-beam

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the value that we can inspect.

Similarly when I was reading about Apache Beam, I found that we can create a PCollection and work with it using following syntax

with beam.Pipeline() as pipeline:
    lines = pipeline | beam.Create(["this is test", "this is another test"])
    word_count = (lines 
                  | "Word" >> beam.ParDo(lambda line: line.split(" "))
                  | "Pair of One" >> beam.Map(lambda w: (w, 1))
                  | "Group" >> beam.GroupByKey()
                  | "Count" >> beam.Map(lambda (w, o): (w, sum(o))))
    result = pipeline.run()

I actually wanted to print the result to console. But I couldn't find any documentation around it.

Is there a way to print the result to console instead of saving it to a file each time?

708

asked Sep 25 '17 13:09

Shamshad Alam

2 Answers

You don't need the temp list. In python 2.7 the following should be sufficient:

def print_row(row):
    print row

(pipeline 
    | ...
    | "print" >> beam.Map(print_row)
)

result = pipeline.run()
result.wait_until_finish()

In python 3.x, print is a function so the following is sufficient:

(pipeline 
    | ...
    | "print" >> beam.Map(print)
)

result = pipeline.run()
result.wait_until_finish()

103

answered Oct 10 '22 04:10

Oliver

After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices.

So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value

import apache_beam as beam

# lets have a sample string
data = ["this is sample data", "this is yet another sample data"]

# create a pipeline
pipeline = beam.Pipeline()
counts = (pipeline | "create" >> beam.Create(data)
    | "split" >> beam.ParDo(lambda row: row.split(" "))
    | "pair" >> beam.Map(lambda w: (w, 1))
    | "group" >> beam.CombinePerKey(sum))

# lets collect our result with a map transformation into output array
output = []
def collect(row):
    output.append(row)
    return True

counts | "print" >> beam.Map(collect)

# Run the pipeline
result = pipeline.run()

# lets wait until result a available
result.wait_until_finish()

# print the output
print output

answered Oct 10 '22 04:10

Shamshad Alam

Related questions
                            
                                Maven conflict in Java app with google-cloud-core-grpc dependency
                            
                                Reading CSV header with Dataflow
                            
                                Buffer and flush Apache Beam streaming data
                            
                                Google Dataflow - Failed to import custom python modules
                            
                                Missing object or bucket in path when running on Dataflow
                            
                                Apache Beam: Unable to find registrar for gs
                            
                                Connecting to Cloud SQL from Dataflow Job
                            
                                BigQueryIO.read().fromQuery performance slow
                            
                                Difference between gcloud auth activate-service-account --key-file and GOOGLE_APPLICATION_CREDENTIALS
                            
                                Dataflow/apache beam - how to access current filename when passing in pattern?
                            
                                Problem in specifying the network in cloud dataflow
                            
                                How to get apache beam for dataflow GCP on Python 3.x
                            
                                Steps to create Cloud Dataflow template using the Python SDK
                            
                                Dataflow, loading a file with a customer supplied encryption key
                            
                                Apache Beam: What is the difference between DoFn and SimpleFunction?
                            
                                Dataflow/apache beam: manage custom module dependencies
                            
                                ClassNotFound exception when attempting to use DataflowRunner
                            
                                How do I write to multiple files in Apache Beam?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Collecting output from Apache Beam pipeline and displaying it to console

Tags:

apache-beam

Shamshad Alam

People also ask

2 Answers

Oliver

Shamshad Alam

Recent Activity

Donate For Us