Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow mixing Integer & Long types

In my Dataflow pipeline, I'm setting the field impressions_raw as a Long in a com.google.api.services.bigquery.model.TableRow object:

enter image description here

Further on in my pipeline, I read the TableRow back out. But instead of a Long, I get back an Integer.

enter image description here

However, if I explicitly set the value to be a Long value greater than Integer.MAX_VALUE, for example 3 billion, then I get back a Long!

enter image description here enter image description here

Is seems that the Dataflow SDK is doing some sort of type check optimization under the hood.

So, without doing ugly type checking, how should one programatically deal with this? (maybe I missed something obvious)

like image 598
Graham Polley Avatar asked Oct 19 '22 22:10

Graham Polley


1 Answers

Thanks for the report. Unfortunately, this problem is fundamental with the use of TableRow. We strongly recommend solution 1 below: convert away from TableRow as soon as practical in your pipeline.

The TableRow object in which you are storing these values is serialized and deserialized by Jackson, inside of TableRowJsonCoder. Jackson has exactly the behavior you're describing -- that is, for this class:

class MyClass {
    Object v;
}

it will serialize an instance with v = Long.valueOf(<number>) as {v: 30} or {v: 3000000000}. On deserializing, however, it will determine the type of the object using the number of bits needed to represent the answer. See this SO post.

Two possible solutions come to mind, with solution 1 strongly recommended:

  1. Do not use TableRow as an intermediate value. In other words, convert to POJO as soon as possible. The key reason this type-mixup happens is that TableRow is essentially a Map<String, Object> and Jackson (or other coders) cannot know that you want a Long back. With a POJO, the types would be clear.

    The other advantage of switching off of TableRow is to get to an efficient coder, say, AvroCoder. Because TableRows are encoded and decoded to/from JSON, the encoding is both verbose and slow -- shuffling TableRow will be both CPU- and I/O-intensive. I expect you'll see much better performance with Avro-coded POJOs than if you're passing TableRow objects around.

    For an example, see LaneInfo in TrafficMaxLaneFlow.

  2. Write code that can handle both:

    long numberToLong(@Nonnull Number n) {
        return n.longValue();
    }
    long x = numberToLong((Number) row.get("field"));
    
    Long numberToLong(@Nonnull Number n) {
        if (n instanceof Long) {
            // avoid a copy
            return n;
        }
        return Long.valueOf(n.longValue());
    }
    Long x = numberToLong((Number) row.get("field"));
    

    You may need additional checks to the second variant if n may be null.

like image 145
Dan Halperin Avatar answered Oct 21 '22 23:10

Dan Halperin