I am running a sum function of one hive table in Hue, and get a return value of NaN.
Here is my code:
select sum(v1) from hivedb.tb1;
I don't know why it is giving me a NaN result. I checked if any of my v1 values are null:
select * from hivedb.tb1 where v1 is null;
, and it turns out no record has null value. The table has 100 million rows, so I can not do a manual check for each record.
Any help is appreciated. Thank you in advance!
UPDATE 1 I manually screened the first 1000 rows, and luckily spotted some abnormal values of NaN in tb1. It is resulted from some rounding error from the previous steps. So my question 1 is probably answered. Please feel free to comment on it, if you think there could be other reasons.
I still don't know how to use an efficient way to spot the rows with NaN values. So I am still looking forward to any answers to my question #2. Please feel free to share. I appreciate your help.
UPDATE 2 The problem is solved with help in the accepted answer below, in the discussion section. There are multiple ways to deal with it.
Hive CAST String to Integer Data Types When you casting String value within the ranges of each data type, you will get numeric as an output. when a value is out of range you will get NULL . In the last example, 128 is out of range for TINYINT hence it returns NULL value when you cast.
2.1 isnull( a ) This returns a true when the value of a (column) is NULL otherwise it returns false. Above example column _C1 is derived based on salary column, if you notice isnull() function return true for value NULL and false for non NULL values.
NaN is short for not a number. This is how null values are displayed in Databricks SQL.
Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.
Quoting that post...
(Float.NaN == Float.NaN)
always returns false.
In fact, if you look at the JDK implementation ofFloat.isNaN()
, a number is not-a-number if it is not equal to itself (which makes sense because a number should be equal to itself).
The same holds for Double.NaN
So, there is no point in showing you how to use the (undocumented) Hive function called reflect2
, which allows you to invoke raw Java methods on Hive columns, i.e.
where v1 is not null and not reflect2(v1, "isNaN")
...because -- in theory -- you can simply state:
where v1 is not null and v1=v1
Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.
In other words, if the simple v1=v1
clause does not filter out the NaN values as expected, then look into reflect2
...
Edit -- indeed, the optimizer appears to ignore the v1=v1
clause in some versions of Hive (see comments) so a more devious formula is necessary:
v1 +1.0 > v1
should work... except when rounding errors make either abs(v1)
<< 1 or abs(v1)
>> 1v1 =0.0
In the end, the most robust approach appears to try cast(v1 as String) <>'NaN'
(because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).
reflect2
-- you can see that it is indeed not mentioned in the official Hive doc, while reflect
is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025
Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive
CLI.
You can handle NaN as
SELECT SUM(CAST(IF(v1 ='NaN', 0, v1)) as Double) FROM hivedb.tb1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With