Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive 'limit' in subquery executes after full query

Tags:

hadoop

hive

I’m testing a rather taxing rlike function in a hive query. I figured I’d test against a subset first before applying it to my TB+ of data. The full query is:

create table proxy_parsed_clean as
select
  a.*,
  case 
    when domainname rlike '.*:443$' then 1
    else 0
  end as used_https
from proxy_parsed a;

Since there's so much data, I wrote a query that (ostensibly) would operate against a subset:

select
  case 
    when a.domainname rlike '.*:443$' then 1
    else 0
  end as used_https
from (select domainname from proxy_parsed limit 10) a;

However, this seems to take just as long as the first query. Rather than applying the outer query to the subset, it seems to apply the case statement to the entire dataset and then limit. Running an explain confirmed my suspicions (notice the limit clause is moved to the end of the query):

> explain select case when a.domainname rlike '.*:443$' then 1 else 0 end from (select domainname from proxy_parsed limit 10) a;

+---------------------------------------------------------------------------------------------------------------------+--+
|                                                       Explain                                                       |
+---------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                                 |
|   Stage-1 is a root stage                                                                                           |
|   Stage-0 depends on stages: Stage-1                                                                                |
|                                                                                                                     |
| STAGE PLANS:                                                                                                        |
|   Stage: Stage-1                                                                                                    |
|     Map Reduce                                                                                                      |
|       Map Operator Tree:                                                                                            |
|           TableScan                                                                                                 |
|             alias: proxy_parsed                                                                                     |
|             Statistics: Num rows: 157462377267 Data size: 6298495090688 Basic stats: COMPLETE Column stats: NONE    |
|             Select Operator                                                                                         |
|               expressions: domainname (type: varchar(40))                                                           |
|               outputColumnNames: _col0                                                                              |
|               Statistics: Num rows: 157462377267 Data size: 6298495090688 Basic stats: COMPLETE Column stats: NONE  |
|               Limit                                                                                                 |
|                 Number of rows: 10                                                                                  |
|                 Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                    |
|                 Reduce Output Operator                                                                              |
|                   sort order:                                                                                       |
|                   Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                  |
|                   TopN Hash Memory Usage: 0.1                                                                       |
|                   value expressions: _col0 (type: varchar(40))                                                      |
|       Reduce Operator Tree:                                                                                         |
|         Select Operator                                                                                             |
|           expressions: VALUE._col0 (type: varchar(40))                                                              |
|           outputColumnNames: _col0                                                                                  |
|           Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                          |
|           Limit                                                                                                     |
|             Number of rows: 10                                                                                      |
|             Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                        |
|             Select Operator                                                                                         |
|               expressions: CASE WHEN ((_col0 rlike '.*:443$')) THEN (1) ELSE (0) END (type: int)                    |
|               outputColumnNames: _col0                                                                              |
|               Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                      |
|               File Output Operator                                                                                  |
|                 compressed: false                                                                                   |
|                 Statistics: Num rows: 10 Data size: 400 Basic stats: COMPLETE Column stats: NONE                    |
|                 table:                                                                                              |
|                     input format: org.apache.hadoop.mapred.TextInputFormat                                          |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                       |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                       |
|                                                                                                                     |
|   Stage: Stage-0                                                                                                    |
|     Fetch Operator                                                                                                  |
|       limit: -1                                                                                                     |
|       Processor Tree:                                                                                               |
|         ListSink                                                                                                    |
|                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------+--+

If I am simply to run select * from proxy_parsed limit 10;, the query executes blazingly quickly. Can someone explain A), why the query is not executing on the subset, and B) how to make it do so?

I could just create a temp table, select 10 records into it and then execute the query, but that seems sloppy. Plus, I'd have a temp table to clean up after that. This behavior seems like a Hive bug, i.e., the limit behavior is clearly not behaving as it seems it should be in this case.

like image 955
TayTay Avatar asked Sep 05 '25 03:09

TayTay


1 Answers

The limit is not applied after the case, but before and during processing the case - it actually gets applied twice. Although it is a coincidence, in this case the two applications of limit correspond to the inner and the outer query, respectively.

In the query plan you can see that the Map phase just selects a single column ("expressions: domainname") and also reduces the number of results to 10 (from 157462377267). This corresponds to the inner query. Then the Reduce phase applies the case ("expressions: CASE WHEN ((_col0 rlike '.*:443$')) THEN (1) ELSE (0) END") and also reduces the number of rows to 10, but you can see that the expected number of input rows is already 10 in this phase. The Reduce phase corresponds to the outer query.

The reason why the limit is applied twice is the distributed execution. Since at the end of the Map phase you want to minimize the amount of data sent to the Reducers, it makes sense to apply the limit here. After the limit is reached, the Mapper won't process any more of the input. This is not enough however, since potentially each Mapper may produce up to 10 results, adding up to ten times the number of Mappers, thereby the Reduce phase has to apply the limit again. Because of this mechanism, in general you should apply the limit directly instead of creating a subquery for this sole purpose.

To summarize, in my interpretation the query plan looks good - the limit is processed in the places where it should. This answers your question about why the limit gets applied before the case. Sadly, it does not explain why it takes so much time though.

Update: Please see ozw1z5rd's answer about why this query is slow in spite of using limit. It explains that using a subquery causes a MapReduce job to be launched, while a direct query avoids that.

like image 164
Zoltan Avatar answered Sep 07 '25 22:09

Zoltan