Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal way to create a histogram/frequency distribution in Oracle?

I have an events table with two columns eventkey (unique, primary-key) and createtime, which stores the creation time of the event as the number of milliseconds since Jan 1 1970 in a NUMBER column.

I would like to create a "histogram" or frequency distribution that shows me how many events were created in each hour of the past week.

Is this the best way to write such a query in Oracle, using the width_bucket() function? Is it possible to derive the number of rows that fall into each bucket using one of the other Oracle analytic functions rather than using width_bucket to determine what bucket number each row belongs to and doing a count(*) over that?

-- 1305504000000 = 5/16/2011 12:00am GMT
-- 1306108800000 = 5/23/2011 12:00am GMT
select 
timestamp '1970-01-01 00:00:00' + numtodsinterval((1305504000000/1000 + (bucket * 60 * 60)), 'second') period_start,
numevents
from (
  select bucket, count(*) as events from (
    select eventkey, createtime, 
    width_bucket(createtime, 1305504000000, 1306108800000, 24 * 7) bucket
    from events 
    where createtime between 1305504000000 and 1306108800000
  ) group by bucket
) 
order by period_start
like image 411
matt b Avatar asked Jun 01 '11 13:06

matt b


People also ask

How are frequency histograms created in Oracle Database?

Starting in Oracle Database 12 c, if the sampling size is the default of AUTO_SAMPLE_SIZE, then the database creates frequency histograms from a full table scan. For all other sampling percentage specifications, the database derives frequency histograms from a sample.

What is a histogram in DBMS?

A histogram is designed to store data distribution information in the dictionary and helps the query optimizer to generate an efficient execution plan. In Oracle Database 12c, there are four types of histograms: frequency histogram (FH), top frequency histogram (TFH), hybrid histogram (HH), and the old-style height-balanced histogram (HBH).

How is the distribution of a histogram determined?

This is determined based on the number of histogram buckets (NHB) and the number of endpoints that fall in the range of (max-min)/NHB. Oracle Database checks whether a column is distributed nonuniformly before creating a histogram.

What is a top frequency histogram?

Oracle Database Reference to learn about the USER_HISTOGRAMS view A top frequency histogram is a variation on a frequency histogram that ignores nonpopular values that are statistically insignificant. For example, if a pile of 1000 coins contains only a single penny, then you can ignore the penny when sorting the coins into buckets.


2 Answers

If your createtime were a date column, this would be trivial:

SELECT TO_CHAR(CREATE_TIME, 'DAY:HH24'), COUNT(*) 
  FROM EVENTS
 GROUP BY TO_CHAR(CREATE_TIME, 'DAY:HH24');

As it is, casting the createtime column isn't too hard:

select TO_CHAR( 
         TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000), 
         'DAY:HH24') AS BUCKET, COUNT(*)
   FROM EVENTS
  WHERE createtime between 1305504000000 and 1306108800000
 group by TO_CHAR( 
         TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000), 
         'DAY:HH24') 
 order by 1

If, alternatively, you're looking for the fencepost values (for example, where do I go from the first decile (0-10%) to the next (11-20%), you'd do something like:

select min(createtime) over (partition by decile) as decile_start,
       max(createtime) over (partition by decile) as decile_end,
       decile
  from (select createtime, 
               ntile (10) over (order by createtime asc) as decile
          from events
         where createtime between 1305504000000 and 1306108800000
       )
like image 52
Adam Musch Avatar answered Nov 05 '22 09:11

Adam Musch


I'm unfamiliar with Oracle's date functions, but I'm pretty certain there's an equivalent way of writing this Postgres statement:

select date_trunc('hour', stamp), count(*)
from your_data
group by date_trunc('hour', stamp)
order by date_trunc('hour', stamp)
like image 38
Denis de Bernardy Avatar answered Nov 05 '22 08:11

Denis de Bernardy