Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract rows based on multiple previous rows' values in SQL Server

Tags:

sql

tsql

I have a table with sample data like this:

ID  Key   User
--  ----  -----
1   a     test
2   ab    test
3   abc   test
4   abcd  test
5   e     test1
6   ef    test1
7   efg   test1
8   efgh  test1
9   t     test1
10  ty    test1
11  tyu   test1
12  tyui  test1

The data consists of sequential "snapshots" of a value as it is being constructed by a user. I want to return the last row for each instance of the user building up a distinct, final Key value. Notice how most rows' Key contain the entire previous row's Key plus an additional letter? I only want the rows that terminate such a sequence and are the longest value possible in each chain of Keys that successively contain the previous Key value.

The above sample data should return the following:

ID  Key   User
--  ----  -----
4   abcd  test
8   efgh  test1
12  tyui  test1

How do I do this?

like image 611
ali Avatar asked Feb 07 '13 07:02

ali


People also ask

How do you use values from previous or next rows in a SQL Server query?

1) You can use MAX or MIN along with OVER clause and add extra condition to it. The extra condition is "ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING" which will fetch previous row value. Check this: SELECT *,MIN(JoiningDate) OVER (ORDER BY JoiningDate ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS EndDate.

How do I select the last 5 records of a table?

METHOD 1 : Using LIMIT clause in descending orderof specified rows from specifies row. We will retrieve last 5 rows in descending order using LIMIT and ORDER BY clauses and finally make the resultant rows ascending. Since Employee table has IDs, we will perform ORDER BY ID in our query.

Can be used to retrieve data from multiple?

In SQL, to fetch data from multiple tables, the join operator is used. The join operator adds or removes rows in the virtual table that is used by SQL server to process data before the other steps of the query consume the data.

Which techniques should be used to obtain a result which is based on comparing one row of a table with another row of the same table?

Comparison of columns in the same table is possible with the help of joins. Here we are comparing all the customers that are in the same city using the self join in SQL. Self-join is a regular join where a table is joined by itself. Similarly, a table may be joined with left join, right join, inner join, and full join.


2 Answers

Without answers to my questions, I had to make these assumptions:

  • The ID column represents chronology and always increases by one with no gaps.
  • SQL Server 2005 or higher

(Update: I made a small tweak that makes this work with "interleaved" data from different users, and added some interleaved and some tricky data to my fiddle.)

So here is my stab at a solution. See it in a SqlFiddle. Of note is that it simulates the LEAD analytic from SQL Server 2012 without a JOIN.

WITH Info AS (
  SELECT
     Grp = Row_Number() OVER (PARTITION BY UserName ORDER BY ID, Which) / 2,
     *
  FROM
     dbo.UserEntry U
     CROSS JOIN (
        SELECT 1 UNION ALL SELECT 2
     ) X (Which)
)
SELECT
   ID = Max(V.ID),
   DataKey = Max(V.DataKey),
   UserName = Max(V.UserName)
FROM
   Info I
   OUTER APPLY (SELECT I.* WHERE Which = 2) V
WHERE I.Grp > 0
GROUP BY
   I.UserName,
   I.Grp
HAVING
   Max(I.DataKey) NOT LIKE Min(I.DataKey) + '_';

Input:

INSERT dbo.UserEntry (ID, DataKey, UserName)
VALUES
(1, 'a', 'test'),
(2, 'ab', 'test'),
(3, 'e', 'test1'),
(4, 'ef', 'test1'),
(5, 'abc', 'test'),
(6, 'abcd', 'test'),
(7, 'efg', 'test1'),
(8, 'efgh', 'test1'),
(9, 't', 'test1'),
(10, 'ty', 'test1'),
(11, 'tyu', 'test1'),
(12, 'tyui', 'test1'),
(13, 't', 'test1'),
(14, 'a', 'test'),
(15, 'a', 'test'),
(16, 'ab', 'test'),
(17, 'abc', 'test'),
(18, 'abcd', 'test'),
(19, 'to', 'test1'), 
(20, 'abcde', 'test'),
(21, 'top', 'test1');

Output:

ID  DataKey  UserName
--  -------  --------
6   abcd     test
8   efgh     test1
12  tyui     test1
14  a        test
20  abcde    test
21  top      test1

Notes: I used different column names because using reserved words as column names is not best practice (it forces you to put square brackets around the names everywhere).

The technique I used will work with a single scan. It has no joins. A properly-constructed join-based query that utilizes appropriate indexes may outperform it slightly in CPU and time, but this solution will have the fewest reads for sure.

Update

While my query may be nice, the particular data structure in this problem lends itself to a very elegant solution I did not consider when I first answered. Thanks to Andriy for the basic idea, here is a dynamite and ultra-simple query (same fiddle as above).

WITH Info AS (
   SELECT
      Grp = Row_Number() OVER (PARTITION BY UserName ORDER BY ID) - Len(DataKey),
         *
   FROM
      dbo.UserEntry U
)
SELECT
   ID = Max(I.ID),
   DataKey = Max(I.DataKey),
   I.UserName
FROM
   Info I
GROUP BY
   I.UserName,
   I.Grp;
like image 142
ErikE Avatar answered Oct 03 '22 06:10

ErikE


Here's another approach:

  1. Use the initial letter as a grouping criterion for a sequence.

  2. Rank the rows separately for every user, sorting them by ID, and subtract the lengths of the Key values from the rankings. Use the results as another sequence grouping criterion.

  3. Rank the rows again, partitioning them this time by user and the criteria from #1 and #2, and sorting them in the descending order of ID.

  4. Fetch the rows with the rankings of 1.

Here's an implementation:

WITH partitioned AS (
  SELECT
    *,
    SeqKey = LEFT([Key], 1),
    SeqGrp = ROW_NUMBER() OVER (
      PARTITION BY UserName
      ORDER BY ID
    ) - LEN([Key])
  FROM dbo.UserEntry
),
ranked AS (
  SELECT
    ID,
    [Key],
    UserName,
    rnk = ROW_NUMBER() OVER (
      PARTITION BY UserName, SeqKey, SeqGrp
      ORDER BY ID DESC
    )
  FROM partitioned
)
SELECT
  ID,
  [Key],
  UserName
FROM ranked
WHERE rnk = 1
;

Like with @ErikE's solution, it is assumed that the ID column defines the order of a sequence. However, the above query will still work correctly if there are gaps in ID values related to the same sequence.

You can try this query at SQL Fiddle as well. (Note: the demo uses @ErikE's DDL.)

like image 24
Andriy M Avatar answered Oct 03 '22 05:10

Andriy M