Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying sequences of rows in SQL

Tags:

regex

sql

Suppose I am storing events associated with users in a table as follows (with dt standing in for the timestamp of the event):

| dt | user | event |
|  1 |  1   |   A   |
|  2 |  1   |   D   |
|  3 |  1   |   B   |
|  4 |  1   |   C   |
|  5 |  1   |   B   |
|  6 |  2   |   B   |
|  7 |  2   |   B   |
|  8 |  2   |   A   |
|  9 |  2   |   A   |
| 10 |  2   |   C   |

Such that we could say:

  • user 1 has an event-sequence of ADBCB
  • user 2 has event-sequence BBAAC

The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. "which users have an event-sequence matching A.*B?" or "which users have an event-sequence matching A[^C]*B[^C]*D?" etc.

What would be a good SQL technique or operator I could use to answer similar queries over this table structure?

Is there a way to efficiently/dynamically generate a table of user-to-event-sequence which could then be queried with regex?

I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS's like SQLServer or Oracle have specialized operators for this as well.

like image 386
nicolaskruchten Avatar asked Apr 24 '11 14:04

nicolaskruchten


People also ask

How do you sequence rows in SQL?

To number rows in a result set, you have to use an SQL window function called ROW_NUMBER() . This function assigns a sequential integer number to each result row. However, it can also be used to number records in different ways, such as by subsets.

What are sequence queries?

Query Sequence. MGI Glossary. Definition. A DNA or protein sequence submitted to a computerized database for comparison, e.g., a BLAST search.

What is sequence statement in SQL?

A sequence is a user-defined schema bound object that generates a sequence of numeric values according to the specification with which the sequence was created. The sequence of numeric values is generated in an ascending or descending order at a defined interval and can be configured to restart (cycle) when exhausted.


2 Answers

With Postgres 9.x this is actually quite easy:

select userid, 
       string_agg(event, '' order by dt) as event_sequence
from events
group by userid;

Using that result you can now apply a regular expression on the event_sequence:

select * 
from (
  select userid, 
         string_agg(event, '' order by dt) as event_sequence
  from events
  group by userid
) t
where event_sequence ~ 'A.*B'

With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an order by in an aggregate function.

like image 80
a_horse_with_no_name Avatar answered Sep 27 '22 20:09

a_horse_with_no_name


I'm not at a computer to write code for this answer, but here's how I would go about a RegEx-based solution in SQL Server:

  1. Build a string from the resultset. Something like http://blog.sqlauthority.com/2009/11/25/sql-server-comma-separated-values-csv-from-table-column/ should work if you omit the comma
  2. Run your RegEx match against the resulting string. Unfortunately, SQL Server does not provide this functionality natively, however, you can use a CLR function for this purpose as described at http://www.ideaexcursion.com/2009/08/18/sql-server-regular-expression-clr-udf/

This should ultimately provide you with the functionality in SQL Server that your original question requests, however, if you're analyzing a very large dataset, this could be quite slow and there may be better ways to accomplish what you're looking for.

like image 23
Taylor Gerring Avatar answered Sep 27 '22 22:09

Taylor Gerring