Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the following fields: 'totalSize' and 'rawDataSize' mean in DESCRIBE EXTENDED query output in hive?

If one runs DESCRIBE EXTENDED command on any hive table the result presents totalSize and rawDataSize values near the end of the output.

What do these fields mean?

Ex:

hive > DESCRIBE EXTENDED <TableName>

Output Results:

Table(tableName:TablenameXXXXX, dbName:XXxXXX,
..........       .......................
numRows=116429472, totalSize=3835205544, rawDataSize=35040221600})
like image 420
Henin RK Avatar asked Jan 06 '16 06:01

Henin RK


3 Answers

rawDataSize is the size of original data set, totalSize is amount of storage it takes. It is applicable for ORC file format, as it compresses the data totalSize will be lesser than rawDataSize.

like image 56
Durga Viswanath Gadiraju Avatar answered Oct 23 '22 14:10

Durga Viswanath Gadiraju


The meaning of the fields is:

  • totalSize - the total size in bytes of the physical files on disk where table data is stored.
  • rawDataSize - is the sum of each datatype size of the columns multiplied by the number of rows in the table. This is also used as an estimate for the query optimizer (e.g. determining if a table is small enough to do a mapjoin instead of simple join).
like image 44
Eugen Avatar answered Oct 23 '22 16:10

Eugen


The size of data is described by two statistics:

  • totalSize — Approximate size of data on disk
  • rawDataSize — Approximate size of data in memory

Hive on MapReduce uses totalSize. When both are available, Hive on Spark uses rawDataSize. Because of compression and serialization, a large difference between totalSize and rawDataSize can occur for the same dataset.

like image 2
Leonel Atencio Avatar answered Oct 23 '22 16:10

Leonel Atencio