The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
Some measurements were taken in the year 1843.
For all weather stations, how would you mark all the days in months that are missing one or more days?
The code to select the data resembles:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
Generate a table filled with contiguous dates and compare them to the measured data dates.
The problem can be recreated using the SQL in this section.
The table is created as follows:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
The following SQL inserts data into a table (id
[int], n
ame [varchar], d
ate [date], valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
The values 'A'
through 'F'
represent the names of weather stations that made a measurement on a particular day.
Delete some rows as follows:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
The following does not toggle the valid
flag to false
for all the days in a month where the month is missing one or more days:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
The following SQL produces an empty result set:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
The following SQL generates all the possible combinations of station names and dates:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.
PostgreSQL has windowing functions.
How to select specific changes using windowing functions in postgres
Thank you!
PostgreSQL's generate_series()
function can create a view that contains a consecutive list of dates:
with calendar as (
select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;
The expression select max(date) - min(date) from test
might be off by one.
One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that climate.calendar
is translated to climate_calendar
.) The second returns the actual daily readings each station produced per month.
This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)
create view count_max_station_calendar_days as
with calendar as (
select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo
The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)
create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;
Drop the ORDER BY
clauses in production (they're helpful in development).
Join the two views to identify the stations and months that need to be flagged, into a new view:
create view invalid_station_months as
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)
n yr mo num_days_missing
--
A 1982 1 1
E 2007 3 1
The column num_days_missing
is not necessary, but it is useful.
These are the rows that need to be updated:
select cc.*
from climate_calendar cc
inner join invalid_station_months im
on (cc.n = im.n and
extract(year from cc.d) = im.yr and
extract(month from cc.d) = im.mo)
where valid = true
To update them, the id
key is convenient.
update climate_calendar
set valid = false
where id in (
select id
from climate_calendar cc
inner join invalid_station_months im
on (cc.n = im.n and
extract(year from cc.d) = im.yr and
extract(month from cc.d) = im.mo)
where valid = true
);
Here is one way you could do it, assuming you have a BOOLEAN field called is_contiguous. Modify as necessary:
UPDATE measurement
SET is_contiguous = FALSE
WHERE NOT EXISTS (
SELECT 1
FROM measurement B
WHERE measurement.taken - 1 = B.taken
);
Edit:
I believe I misunderstood your requirement. I thought you wanted to flag individual dates that were not contiguous. But apparently, you want to flag an entire month's worth of dates as discontiguous if it is missing any number of days.
Edit 2:
Here is a modified version of my original (incorrect) query which selects the distinct months that are missing any days:
UPDATE measurement
SET is_contiguous = FALSE
WHERE date_trunc('month', taken) IN (
SELECT DISTINCT date_trunc('month', taken)
FROM measurement A
WHERE NOT EXISTS (
SELECT 1
FROM measurement B
WHERE A.taken - 1 = B.taken
)
);
Assuming that there can't be more than one row per day, this should return all the months for which the number of rows is not equal to the number of days in the month.
SELECT station_id, DATE_TRUNC('month', d)
FROM climate.calendar
GROUP BY station_id, DATE_TRUNC('month', d)
HAVING COUNT(*) <>
DATE_PART('month',
DATE_TRUNC('month', d) + INTERVAL '1 month' - INTERVAL '1 day')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With