I was going through bigquery's chicago taxi data. I expected to get two of the same answers from these two codes. Note that trip_start_timestamp is originally in datetime format (e.g 2015-12-23 00:00:00 UTC)
SELECT
COUNT(DISTINCT unique_key) AS num_trips
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_start_timestamp >= '2015-12-23'
and
trip_start_timestamp <= '2015-12-27'
SELECT
COUNT(DISTINCT unique_key) AS num_trips
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
cast(trip_start_timestamp as date) >= '2015-12-23'
and
cast(trip_start_timestamp as date) <= '2015-12-27'
I tried to see if they were the same by looking at this code:
SELECT
COUNT(DISTINCT unique_key) AS num_trips, trip_start_timestamp
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_start_timestamp >= '2015-12-23'
and
trip_start_timestamp <= '2015-12-27'
GROUP BY trip_start_timestamp
ORDER BY trip_start_timestamp
This produces 
Then simply changing the WHERE condition with the CAST(trip_start_timestamp as date) variant. The tables look exactly the same.
Yet the end results between the first 2 codes is 150,000 results vs 200,000. Can anyone note why there might be a difference?
All records with timestamp for '2015-12-27' date - like for example '2015-12-27 06:15:00 UTC' - are considered > then '2015-12-27' which I would expect you agree make sense
At the same time when you CAST such timestamp to DATE it "becomes" a 2015-12-27 which is obviously = '2015-12-27'
So, there are 45,413 such records, which is exactly difference between the two results. Because for such records below WHERE clause is false
WHERE trip_start_timestamp >= '2015-12-23'
and trip_start_timestamp <= '2015-12-27'
Below are simplified examples of above effect
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP('2015-12-27 06:15:00 UTC') trip_start_timestamp UNION ALL
SELECT TIMESTAMP('2015-12-26 06:15:00 UTC')
)
SELECT *
FROM `project.dataset.table`
WHERE trip_start_timestamp >= '2015-12-23'
AND trip_start_timestamp <= '2015-12-27'
returns ONLY ONE record
Row trip_start_timestamp
1 2015-12-26 06:15:00 UTC
while
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP('2015-12-27 06:15:00 UTC') trip_start_timestamp UNION ALL
SELECT TIMESTAMP('2015-12-26 06:15:00 UTC')
)
SELECT *
FROM `project.dataset.table`
WHERE CAST(trip_start_timestamp AS DATE) >= '2015-12-23'
AND CAST(trip_start_timestamp AS DATE) <= '2015-12-27'
returns ALL Two records
Row trip_start_timestamp
1 2015-12-27 06:15:00 UTC
2 2015-12-26 06:15:00 UTC
This is because timestamp '2015-12-27 06:15:00 UTC' is considered greater than '2015-12-27' and as such is excluded from result of first query
Hope this simplified example will help :o)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With