I have a question regarding the Binary data type. I am trying to write a Parquet Schema for my MR job to create the Parquet file contrary to have Hive or Impala create one. I see some references to a Binary type which I do not see in Parquet
Is binary an alias to BYTE_ARRAY?
Also is UTF-8 a default encoding on Binary data types?
Raw bytes are stored in Parquet either as a fixed-length byte array (FIXED_LEN_BYTE_ARRAY) or as a variable-length byte array (BYTE_ARRAY, also called binary). Fixed is used when you have values with a constant size, like a SHA1 hash value. Most of the time, the variable-length version is used.
Strings are encoded as variable-length binary with the UTF8 type annotation to indicate how to interpret the raw bytes back into a String. UTF8 is the only encoding supported in the format, but not every binary uses UTF8 because not all binary fields are storing string data.
There is no data type in parquet-column called BYTE_ARRAY. I saw their PrimitiveType in latest package but could not see it. Could not write byte[] in binary as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With