I'm creating sliding time windows 20 seconds long every 5 seconds from batched log data: <pre class="prettyprint"><code> rows = p | 'read events' >> beam.io.Read(beam.io.BigQuerySource(query=query)) # set timestamp field used for windowing and set 20 second long window every 5 seconds ts_rows = (rows | 'set timestamp' >> beam.ParDo(AddTimestampDoFn()) | 'set window' >> beam.WindowInto(window.SlidingWindows(20,5))) # extract only user id and relevant data, group and process rows_with_data = (ts_rows | 'extract data' >> beam.FlatMap(lambda row: [(str(row['user_id']),[row['data1'], row['data2'],row['data3']])]) | 'group by user id' >> beam.GroupByKey() | 'Process window' >> beam.ParDo(WindowDataProcessingDoFn())) </code></pre> How can I access the timestamp information for each window in Python? (An answer for Java is here but I don't know how to translate it into Python: How to get the max timestamp of the current sliding window) Ideally I'd want the end time of each window rather than the max or min timestamp of the data within the window.

I went to the link you provided. Note: <code>window=beam.DoFn.WindowParam</code> is the parameter which is mentioned on the page you linked. The window end time is <code>beam.DoFn.WindowParam.end</code>. In Python, you can access it in like this: Define your DoFn: <pre class="prettyprint"><code>class BuildRecordFn(beam.DoFn): def __init__(self): super(BuildAdsRecordFn, self).__init__() def process(self, element, window=beam.DoFn.WindowParam): #window_start = window.start.to_utc_datetime() window_end = window.end.to_utc_datetime() return [element + (window_end,)] </code></pre> Then use it like this: <pre class="prettyprint"><code> lines = p | ReadFromText(known_args.input) counts = ( lines | 'ParseEventFn' >> beam.ParDo(ParseEventFn()) | 'AddEventTimestamp' >> beam.Map( lambda elem: beam.window.TimestampedValue(elem, elem['timestamp'])) | 'ExtractObjectID' >> beam.Map(lambda elem: (elem["objectID"])) | 'FixedWindow' >> beam.WindowInto( beam.window.FixedWindows(60*1)) | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | 'GroupAndSum' >> beam.CombinePerKey(sum) | 'AddWindowEndTimestamp'(beam.ParDo(BuildRecordFn())) | 'Format' >> beam.Map(format_result) | WriteToText(known_args.output) def format_result(xs): ys = [str(x) for x in xs] return ','.join(ys) </code></pre>

How to get the end of window timestamp in Apache Beam Python

Tags:

python

apache-beam

google-cloud-dataflow

I'm creating sliding time windows 20 seconds long every 5 seconds from batched log data:

    rows = p | 'read events' >> beam.io.Read(beam.io.BigQuerySource(query=query))

    # set timestamp field used for windowing and set 20 second long window every 5 seconds
    ts_rows = (rows | 'set timestamp' >> beam.ParDo(AddTimestampDoFn())
                    | 'set window' >> beam.WindowInto(window.SlidingWindows(20,5)))

    # extract only user id and relevant data, group and process
    rows_with_data = (ts_rows | 'extract data' >> beam.FlatMap(lambda row: 
                                [(str(row['user_id']),[row['data1'], row['data2'],row['data3']])])
                              | 'group by user id' >> beam.GroupByKey()
                              | 'Process window' >> beam.ParDo(WindowDataProcessingDoFn()))

How can I access the timestamp information for each window in Python? (An answer for Java is here but I don't know how to translate it into Python: How to get the max timestamp of the current sliding window) Ideally I'd want the end time of each window rather than the max or min timestamp of the data within the window.

304

asked Sep 15 '17 13:09

Mike Keyes

1 Answers

I went to the link you provided.

Note: window=beam.DoFn.WindowParam is the parameter which is mentioned on the page you linked.

The window end time is beam.DoFn.WindowParam.end. In Python, you can access it in like this:

Define your DoFn:

class BuildRecordFn(beam.DoFn):
def __init__(self):
    super(BuildAdsRecordFn, self).__init__()

def process(self, element,  window=beam.DoFn.WindowParam):
    #window_start = window.start.to_utc_datetime()
    window_end = window.end.to_utc_datetime()
    return [element + (window_end,)]

Then use it like this:

    lines = p | ReadFromText(known_args.input)
    counts = (
        lines
        | 'ParseEventFn' >> beam.ParDo(ParseEventFn())

        | 'AddEventTimestamp' >> beam.Map(
            lambda elem: beam.window.TimestampedValue(elem, elem['timestamp']))

        | 'ExtractObjectID' >> beam.Map(lambda elem: (elem["objectID"]))

        | 'FixedWindow' >> beam.WindowInto(
            beam.window.FixedWindows(60*1))

        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))

        | 'GroupAndSum' >> beam.CombinePerKey(sum)

        | 'AddWindowEndTimestamp'(beam.ParDo(BuildRecordFn()))

        | 'Format' >> beam.Map(format_result)

        | WriteToText(known_args.output) 


    def format_result(xs):
        ys = [str(x) for x in xs]
        return ','.join(ys)

answered Oct 06 '22 00:10

x97Core

Related questions
                            
                                Google Cloud Pubsub Data lost
                            
                                Perform action after Dataflow pipeline has processed all data
                            
                                Is there a way to set the target for a task dynamically with the App Engine Java runtime?
                            
                                BigQuery - select top N posts from a large table for each subreddit
                            
                                Stream Error in the HTTP/2 framing layer: bigrquery commands error in R studio but not in Base R
                            
                                How do I receive notification if a Google Compute Engine instance restarts or migrates on maintenance?
                            
                                How to Insert new data in existing array in Firebase Database from Android?
                            
                                Python: How to update a value in Google BigQuery in less than 40 seconds?
                            
                                How to create a Push Notification (FCM) using C#
                            
                                Unable to connect to HTTP service running on Google Compute Engine VM instance
                            
                                Is it possible to start firebase serve with --inspect-brk as we did in node?
                            
                                Not able to create the file on Google Cloud Storage
                            
                                Running Freeswitch on Google Container Engine
                            
                                Firebase functions: logging with winston in stackdriver console
                            
                                Uncaught RangeError: Maximum call stack size exceeded ONLY on Production
                            
                                How to define firebase-remote-config parameters based on app version
                            
                                WRITE_TRUNCATE behaviour in Big Query
                            
                                Google Cloud SQL proxy couldn't find default credentials
                            
                                Create an empty child record in Firebase
                            
                                Upload files to Firebase Storage using Node.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With