1) If the partitioned column doesn't have data, so when you query on that, what error will you get?
2)If some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?
3)Why bucketing needs to be done with numeric column? Can we use string column also? what is the process and on what basis you will choose the bucketing column?
4) Will the internal table details will also be stored in the metastore? Or only external table details will be stored?
5) What type of queries, that runs only at mapper side not in reducer and vice versa?
Short answers:
1. if the partitioned column doesn't have data, so when u query on that, what error will you get?
Partitioned column in Hive is a folder named key=value with data files inside. And if it has no data, it means no partitions folders exist and the table is empty, no error displayed, no data returned.
When you inserting null in partitioned column using dynamic partitioning all NULL values within the partitioning column (and all values which do not conform to the field type) loaded as __HIVE_DEFAULT_PARTITION__ If the column type is numeric in this case then the type cast error will be thrown during select. Something like cannot cast textWritable to IntWritable for example
2. if some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?
If "does not have" means NULLs, then loaded as HIVE_DEFAULT_PARTITION Actually it is still possible to get data, no loss happened
3. Why bucketing needs to be done with numeric column? -it does not need to be numeric can we use string column also? Yes. what is the process and on what basis you will choose the bucketing column.?
Columns for bucketing should be chosen based on joins/filter columns. Values are being hashed, distributed and sorted(clustered) and the same hashes are being written (during insert overwrite) in the same buckets(files). The number of buckets and columns are specified in the table DDL.
Bucketed table and bucket-map-join is a bit outdated concept, you can achieve the same using DISTRIBUTE BY + sort + ORC. This approach is more flexible.
4. will the internal table details will also be stored in the metastore? or only external table details will be stored?
Does not matter external or managed. Table schema/grants/statistics is stored in the metastore.
5. what type of queries ,that runs only at mapper side not in reducer and vice versa?
Queries without aggregations, map-joins(when small table fits in memory), simple columns transformations (simple column UDFs like regexp_replace, split, substr, trim, concat, etc), filters in WHERE, sort by - can be executed on mapper.
Aggregations and analytics, common joins, order by, distribute by, UDAFs are executed on mapper+reducer.
runs only at mapper side not in reducer and vice versa
vice versa is not possible. Mapper is used to read data files, reducer is the next optional step which can not exist without mapper, though map->reduce->reduce... is possible when running on Tez execution engine. Tez can represent complex query as a single DAG and run as a single job and remove unnecessary steps used in MR engine such as writng of intermediate results into hdfs and reading again using mapper. Even in MR map-only jobs are possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With