WRT a 3 node cluster composed of c3.2xlarge instances.
I have two tables. Table U has about 65 Million records and contains among other fields latitude and longitude. Table L has about 1 million records and also contains latitude and longitude.
U is stored as an ORC table.
Task is to determine how many U records fall within a 10 mile radius of locations in L.
select l.id, count(u.id) from U u, L l where 3960 *
acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.long) -
radians(u.long)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
group by l.id;
The bit 3960 * acos(cos(radians(l.lat)) * cos(radians(u.lat)) *
cos(radians(l.long) - radians(u.long)) + sin(radians(l.lat)) *
sin(radians(u.lat))) < 10.0
is just the distance between the lat/long
pairs should be less than 10 miles constraint.
Problem: The query seems to take forever to complete. While the map phase completes relatively quickly the reduce phase gets stuck at some fixed percentage (80% ish)
I noticed this in the output messages that Hive emits. Number of reduce tasks determined at compile time: 1
I tried increasing the number of reducers by setting mapred.reduce.tasks to 7 but it always ends up as 1. I was not successful in increasing the number of reducers.
This answer seems to suggest that perhaps if I write my query in a different way then I can force more than 1 reducer. But I haven't been able to do that yet.
Execution Time estimates :For a single location in L it takes about 60 seconds to get an answer. By that account it should take 60 million seconds which is around 700 days! Should it take so much time? Even for Hadoop.
I have also tried putting addional constraints like bounding the lat, long in a 10 mile by 10 mile square box with the location in L at the centre of the box but the time taken is now 40 seconds for 1 location which is not a vast improvement.
Questions:
1) How can I force more number of reducers? 2) Is there a better(in terms of execution time) query? 3) Any other advice that will help me solve this problem.
Version: Hadoop - 2.7.0 Java 1.7.0_80 Hive 1.2.1
As told earlier,
mapred.reduce.tasks replaced with mapreduce.job.reduces
Have a look at all deprecated attributes and new values for those deprecated values from this article.
Apart from deprecated attributes, you have to migrate some of the functions as well as explained in this article
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With