Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.

I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc

So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,

Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

  1. If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.

    1. Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.

    2. There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

What is the cost I have to pay for using string instead of varchar in terms of storage and performance

like image 312
Dave Avatar asked Jul 19 '17 13:07

Dave


2 Answers

My case will be to restrict and focus this discussion around ORC format given its become a default standard for Hive storage.I don't believe performance is really a question between VARCHAR and STRING in Hive itself. The encoding of the data (refer link below) is the same on both the cases when it comes to ORC format. This applies even when you are using your custom Serde, its all treated as STRING and encoding then applied.

The real issue for me will be how the STRING is consumed by other third party tools and programming languages. If the end use has no documented issue with STRING, its easy to move forward with STRING as type over VARCHAR(n) type. This is especially so useful when working with ETL that requires mapping elements over a pipeline and you don't want to take risk of size errors ignored. Coming back to third party tools, For example SAS has number of documented issues with reading STRING type when connected to Hive. It will become a pain area for some and for some it will be a point of awareness in their respective architecture. For example, a database when connecting to Hive via JDBC or ODBC might read the data as VARCHAR(max) which may imply number of challenges that needs to be considered.

I would suggest to consider this as a major factor rather than performance in Hive itself. I have not come across anything so far that suggests VARCHAR performs better than STRING for deciding the type to be used.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-StringColumnSerialization

Another point is that VARCHAR now supports vectorization. In any case, UDF that receives VARCHAR will be considered STRING therefore point negated.

Thanks for correcting me in case you find the understanding different. Also, can provide a reference link that may help.

like image 141
Sunil K-Standard Chartered Avatar answered Oct 02 '22 19:10

Sunil K-Standard Chartered


Lets try to understand from how it is implemented in API:-

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter 

Here is the magic begins -->

private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
        return new StringDataWriter((StringObjectInspector)inspector);
    case VARCHAR:
        return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);

}

createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar or string, accordingly it creates writer class for these types.

Now lets move on to VarcharDataWriter class.

private class VarcharDataWriter implements DataWriter {
    private HiveVarcharObjectInspector inspector;

    public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value).getValue();
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

OR

to StringDataWriter class

private class StringDataWriter implements DataWriter {
    private StringObjectInspector inspector;

    public StringDataWriter(StringObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value);
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.

short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read tool. ParquetRecordReader knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.

like image 20
sumitya Avatar answered Oct 02 '22 20:10

sumitya