Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read files from a PCollection of GCS filenames in Pipeline?

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).

Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:

Get the topic from pub/sub ParDo to read each file and get the lines Process the lines of the file...

Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.

like image 886
Isabelle Woodrow Avatar asked Aug 28 '15 18:08

Isabelle Woodrow


People also ask

How do you read PCollection?

To read a PCollection from one or more text files, use TextIO. read() to instantiate a transform and use TextIO. Read. from(String) to specify the path of the file(s) to be read.

What is dataflow PCollection?

PCollection. A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

What is dataflow template?

Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone with the correct permissions can then use the template to deploy the packaged pipeline. You can create your own custom Dataflow templates, and Google provides pre-built templates for common scenarios.


1 Answers

You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:

PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());

This will read all lines in each file arriving over pubsub.

like image 144
jkff Avatar answered Oct 20 '22 22:10

jkff