Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Figuring out difference between these two queries involving datetime

I was going through bigquery's chicago taxi data. I expected to get two of the same answers from these two codes. Note that trip_start_timestamp is originally in datetime format (e.g 2015-12-23 00:00:00 UTC)

SELECT
   COUNT(DISTINCT unique_key) AS num_trips
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE  
trip_start_timestamp >= '2015-12-23'
and 
trip_start_timestamp <= '2015-12-27'

SELECT
  COUNT(DISTINCT unique_key) AS num_trips
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE  
cast(trip_start_timestamp as date) >= '2015-12-23'
and 
cast(trip_start_timestamp as date) <= '2015-12-27'

I tried to see if they were the same by looking at this code:

SELECT 
    COUNT(DISTINCT unique_key) AS num_trips, trip_start_timestamp
 FROM
      `bigquery-public-data.chicago_taxi_trips.taxi_trips`
    WHERE  
     trip_start_timestamp >= '2015-12-23'
     and 
     trip_start_timestamp <= '2015-12-27'
GROUP BY trip_start_timestamp 
ORDER BY trip_start_timestamp 

This produces enter image description here

Then simply changing the WHERE condition with the CAST(trip_start_timestamp as date) variant. The tables look exactly the same.

Yet the end results between the first 2 codes is 150,000 results vs 200,000. Can anyone note why there might be a difference?

like image 204
NimbleTortoise Avatar asked Mar 02 '26 05:03

NimbleTortoise


1 Answers

All records with timestamp for '2015-12-27' date - like for example '2015-12-27 06:15:00 UTC' - are considered > then '2015-12-27' which I would expect you agree make sense
At the same time when you CAST such timestamp to DATE it "becomes" a 2015-12-27 which is obviously = '2015-12-27'

So, there are 45,413 such records, which is exactly difference between the two results. Because for such records below WHERE clause is false

WHERE trip_start_timestamp >= '2015-12-23'
and trip_start_timestamp <= '2015-12-27'  

Below are simplified examples of above effect

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP('2015-12-27 06:15:00 UTC') trip_start_timestamp UNION ALL
  SELECT TIMESTAMP('2015-12-26 06:15:00 UTC')
)
SELECT *
FROM `project.dataset.table`
WHERE trip_start_timestamp >= '2015-12-23'
AND trip_start_timestamp <= '2015-12-27' 

returns ONLY ONE record

Row trip_start_timestamp     
1   2015-12-26 06:15:00 UTC  

while

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP('2015-12-27 06:15:00 UTC') trip_start_timestamp UNION ALL
  SELECT TIMESTAMP('2015-12-26 06:15:00 UTC')
)
SELECT *
FROM `project.dataset.table`
WHERE CAST(trip_start_timestamp AS DATE) >= '2015-12-23'
AND CAST(trip_start_timestamp AS DATE) <= '2015-12-27'

returns ALL Two records

Row trip_start_timestamp     
1   2015-12-27 06:15:00 UTC  
2   2015-12-26 06:15:00 UTC  

This is because timestamp '2015-12-27 06:15:00 UTC' is considered greater than '2015-12-27' and as such is excluded from result of first query

Hope this simplified example will help :o)

like image 162
Mikhail Berlyant Avatar answered Mar 03 '26 21:03

Mikhail Berlyant