Read files from a PCollection of GCS filenames in Pipeline?

Tags:

google-cloud-dataflow

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).

Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:

Get the topic from pub/sub ParDo to read each file and get the lines Process the lines of the file...

Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.

886

asked Aug 28 '15 18:08

Isabelle Woodrow

1 Answers

You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:

PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());

This will read all lines in each file arriving over pubsub.

144

answered Oct 20 '22 22:10

jkff

Related questions
                            
                                Buffer and flush Apache Beam streaming data
                            
                                Google Dataflow - Failed to import custom python modules
                            
                                Long lived state with Google Dataflow
                            
                                Missing object or bucket in path when running on Dataflow
                            
                                Apache Beam: Unable to find registrar for gs
                            
                                Connecting to Cloud SQL from Dataflow Job
                            
                                BigQueryIO.read().fromQuery performance slow
                            
                                Dataflow/apache beam - how to access current filename when passing in pattern?
                            
                                Problem in specifying the network in cloud dataflow
                            
                                Gradle Support for GCP Dataflow Templates?
                            
                                Stream BigQuery table into Google Pub/Sub
                            
                                How to get apache beam for dataflow GCP on Python 3.x
                            
                                gsutil - is it possible to list only folders?
                            
                                How do I make sure my Dataflow pipeline scales?
                            
                                Writing Output of a Dataflow Pipeline to a Partitioned Destination
                            
                                Permissioning Dataflow to read a BigQuery table that is pointing to Drive?
                            
                                Steps to create Cloud Dataflow template using the Python SDK
                            
                                Is it possible to load a pretrained Pytorch model from a GCS bucket URL without first persisting locally?
                            
                                How to trigger Cloud Dataflow pipeline job from Cloud Function in Java?
                            
                                How to run Google Cloud Dataflow job from App Engine?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With