Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to model process and status history in a data warehouse?

Let's say that we have D_PROCESS, D_WORKER and D_STATUS as dimensions, and the fact F_EVENT that links a process (what) with a worker (who's in charge) and the "current" status.

The process status changes over time. Shoud we store in F_EVENT one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker?

I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization.

like image 902
Francois Avatar asked Jun 20 '12 07:06

Francois


1 Answers

Read The Data Warehouse Toolkit by Ralph Kimball for a good introduction to dimensional modeling.

It sounds like you are storing a process change event in F_EVENT. If this process has a defined beginning and end, I would build a snapshot fact table which would let you track the process over time (simply updating the row each time the process moves from one step to another).

EDIT:

I'll try to make a general case using your dimensions as examples.

For D_PROCESS, modeling a "process" isn't usually modeled as a dimension, and you called it a "what", so I'm going to rename this to "D_ACCOUNT".

The basic data model will be for a "tax processing system" in which WORKERS are processing ACCOUNTS, and each ACCOUNT/WORKER combination has several possible "STATUSES" of where this process currently stands.

D_ACCOUNT
    ACCOUNT_NUMBER
    ACCOUNT_TYPE

D_WORKER
    WORKER_ID
    FIRST_NAME
    LAST_NAME
    BADGE_NUMBER
    SHIFT

D_STATUS
    STATUS_ID
    STATUS_NAME

Now if I want to report on all "events" that have happened to an Account, performed by a worker, I can build a Transaction-level fact table F_EVENT:

F_EVENT
    ACCOUNT_ID
    WORKER_ID
    STATUS_ID
    EVENT_TIME_ID
    Metrics taken at time of the measurement (Cost, Worker time spent, etc)

We call the unique combination of dimensions that identifies a row the Granularity or Grain of the fact table.

The grain of this table is Account, Worker, Status, and Time. It answer questions like "How much time did my workers on shift 3 spend processing accounts on Wednesday?" or "How many events occured that changed the processing status to "CLOSED"?

I'm not sure how much this type of table would help.

Instead, say you are interested in tracking the process itself as it moves through various statuses. I'm going to assume that the status always moves forward in time, from "NOT STARTED" to "IN PROCESS" to "CLOSED".

I'll build what Kimball calls an "Accumulating Snapshot Fact table.

F_TAXPROCESSING
    ACCOUNT_ID
    WORKER_ID 
    CURRENT_STATUS_ID
    NOT_STARTED_DTTM
    NOT_STARTED_FLAG
    IN_PROCESS_DTTM
    IN_PROCESS_FLAG
    CLOSED_DTTM
    CLOSED_FLAG

This table's grain is Account, Worker. This table keeps track of the "process" by updating the date/time of the change to the status, and a flag when that status has been reached.

This allows you to track the process over time, allowing you to see how many accounts have reacted the "IN PROCESS" status, how long it took to get there, et cetera.

like image 165
N West Avatar answered Sep 23 '22 18:09

N West