Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a function with whole columns as input and output

I have several programs written in R that now I need to translate in T-SQL to deliver them to the client. I am new to T-SQL and I'm facing some difficulties in translating all my R functions.

An example is the numerical derivative function, which for two input columns (values and time) would return another column (of same length) with the computed derivative.

My current understanding is:

  1. I can't use SP, because I'll need to use this functions inline with select statement, like: SELECT Customer_ID, Date, Amount, derivative(Amount, Date) FROM Customer_Detail

  2. I can't use UDF, because they can take, as input parameter, only scalar. I'll need vectorised function due to speed and also because for some functions I have, like the one above, running row by row wouldn't be meaningful (for each value it needs the next and the previous)

  3. UDA take whole column but, as the name says..., they will aggregate the column like sum or avg would.

If the above is correct, which other techniques would allow me to create the type of function I need? An example of SQL built-in function similar to what I'm after is square() which (apparently) takes a column and returns itself^2. My goal is creating a library of functions which behave like square, power, etc. But internally it'll be different cause square takes and returns each scalar is read through the rows. I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?

NB: At the moment I'm on SQL-Server 2005 but we'll switch soon to 2012 (or possibly 2014 in few months) so answers based on any 2005+ version of SQL-Server are fine.

EDIT: added the R tag for R developers who have, hopefully, already faced such difficulties.

EDIT2: Added CLR tag: I went through CLR user defined aggregate as defined in the Pro t-sql 2005 programmers guide. I already said above that this type of function wouldn't fit my needs but it was worth looking into it. The 4 methods needed by a UDA are: Init, Accumulate, Merge and Terminate. My request would need the whole data being analysed all together by the same instance of the UDA. So options including merge methods to group together partial results from multicore processing won't be working.

like image 419
Michele Avatar asked Sep 26 '13 14:09

Michele


1 Answers

I think you may consider changing your mind a bit. SQL language is very good when working with sets of data, especially modern RDBMS implementations (like SQL Server 2012), but you have to think in sets, not in rows or columns. While I stilldon't know your exact tasks, let's see - SQL Server 2012 have very nice set of window functions + ranking functions + analytic functions + common table expressions, so you can write almost any query inline. You can use chains of common table expression to turn your data any way you want, to calculate running totals, to calculate averages or other aggregates over window and so on.

Actually, I've always liked SQL and when I've learned functional language (ML and Scala) a bit, my thought was that my approach to SQL is very similar to functional language paradigm - just slicing and dicing data without saving anything into variables, untils you have resultset your need.

Just quick example, here's a question from SO - How to get average of the 'middle' values in a group?. The goal was to get the average for each group of the middle 3 values:

TEST_ID TEST_VALUE  GROUP_ID
1       5           1       -+
2       10          1        +- these values for group_id = 1
3       15          1       -+
4       25          2       -+
5       35          2        +- these values for group_id = 2
6       5           2       -+
7       15          2       
8       25          3
9       45          3       -+
10      55          3        +- these values for group_id = 3
11      15          3       -+
12      5           3
13      25          3
14      45          4       +- this value for group_id = 4

For me, it's not an easy task to do in R, but in SQL it could be a really simple query like this:

with cte as (
    select
        *,
        row_number() over(partition by group_id order by test_value) as rn,
        count(*) over(partition by group_id) as cnt
    from test
)
select
    group_id, avg(test_value)
from cte
where
    cnt <= 3 or
    (rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id

You can also easily expand this query to get 5 values around the middle.

TAke closer look to analytical functions, try to rethink your calculations in terms of window functions, may be it's not so hard to rewrite your R procedures in plain SQL.

Hope it helps.

like image 59
Roman Pekar Avatar answered Oct 30 '22 16:10

Roman Pekar