Suppose I am storing events
associated with users
in a table as follows (with dt
standing in for the timestamp of the event):
| dt | user | event |
| 1 | 1 | A |
| 2 | 1 | D |
| 3 | 1 | B |
| 4 | 1 | C |
| 5 | 1 | B |
| 6 | 2 | B |
| 7 | 2 | B |
| 8 | 2 | A |
| 9 | 2 | A |
| 10 | 2 | C |
Such that we could say:
The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. "which users have an event-sequence matching A.*B?" or "which users have an event-sequence matching A[^C]*B[^C]*D?" etc.
What would be a good SQL technique or operator I could use to answer similar queries over this table structure?
Is there a way to efficiently/dynamically generate a table of user
-to-event-sequence
which could then be queried with regex?
I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS's like SQLServer or Oracle have specialized operators for this as well.
To number rows in a result set, you have to use an SQL window function called ROW_NUMBER() . This function assigns a sequential integer number to each result row. However, it can also be used to number records in different ways, such as by subsets.
Query Sequence. MGI Glossary. Definition. A DNA or protein sequence submitted to a computerized database for comparison, e.g., a BLAST search.
A sequence is a user-defined schema bound object that generates a sequence of numeric values according to the specification with which the sequence was created. The sequence of numeric values is generated in an ascending or descending order at a defined interval and can be configured to restart (cycle) when exhausted.
With Postgres 9.x this is actually quite easy:
select userid,
string_agg(event, '' order by dt) as event_sequence
from events
group by userid;
Using that result you can now apply a regular expression on the event_sequence:
select *
from (
select userid,
string_agg(event, '' order by dt) as event_sequence
from events
group by userid
) t
where event_sequence ~ 'A.*B'
With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an order by
in an aggregate function.
I'm not at a computer to write code for this answer, but here's how I would go about a RegEx-based solution in SQL Server:
This should ultimately provide you with the functionality in SQL Server that your original question requests, however, if you're analyzing a very large dataset, this could be quite slow and there may be better ways to accomplish what you're looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With