Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Beam BigQueryIO write slow

My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the BigQueryWriteTemp temp folder first if I run it with DirectRunner. This is obviously not performing very well. Am I doing something wrong here? This is a normal batch job and not streaming. (The same job running with DataflowRunner doesn't seem to do this)

myrows.apply("WriteToBigQuery",
                BigQueryIO.writeTableRows().to(BQ_TARGET_TABLE)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));

This is the log we're seeing. Each of these files contains exactly one TableRow. The same on DataflowRunner seems to create only about 3 files.

2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/59668b03-a1e8-4288-a049-3472e7cb6333.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/feeb454b-799e-4d77-bd12-dec313cdadc2.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3c63db33-787f-4215-a425-3446d92157ed.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/87d55556-e012-4bef-8856-69efd4c5ab26.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5e6bfe94-b1c9-49cb-b0c7-a768d78d85f3.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/b236948b-bdf0-4bfe-9d26-4e67c8904320.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/451abb93-e02a-4210-aa46-5afa0c82547d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/60fd5ecc-8dbe-46e4-884d-3767694b009f.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/f3a5b4e0-e956-4a41-a78d-c7694950b6f1.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/a4e4c74f-d12c-495d-bf28-eb20ee25f086.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/eb3b29e1-cc0c-4a6d-82f4-8527d0c5a51e.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/916ac41b-4ece-42bb-bf24-c5ca17060d1d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5b76128f-3c66-4701-92ce-2d3ba2e91f65.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3a0ae709-756e-452c-9b0f-6efa9c0864ca.
like image 294
jimmy Avatar asked Aug 10 '17 14:08

jimmy


1 Answers

Direct runner is meant for testing and development and includes additional checks to insure that the pipeline will run correctly in other runners. This comes with the side effect of decreasing performance.

Here are the additional checks:

  • enforcing immutability of elements
  • enforcing encodability of elements
  • elements are processed in an arbitrary order at all points
  • serialization of user functions (DoFn, CombineFn, etc.)
like image 57
Nathan Nasser Avatar answered Oct 26 '22 18:10

Nathan Nasser