I managed to write a few scalar functions with Python in AmazonRedshift, i.e. taking one or a few columns as input and returning a single value based on some logic or transformation.
But is there any way to pass all the values of a numeric column(i.e. a list) in a UDF and calculate statistics on those, for example the mean or standard deviation ?
In Amazon Redshift, the Python logic is pushed across the MPP system and all the scaling is handled by AWS. The Python execution in Amazon Redshift is done in parallel just as a normal SQL query, so Amazon Redshift will take advantage of all of the CPU cores in your cluster to execute your UDFs.
Setting up Python Redshift Integration can help you to access and query your Amazon Redshift data with ease. However, loading data from any source to Redshift manually is a tough nut to crack.
Snowflake supports UDFs written in multiple languages, including Python. Python UDFs are scalar functions; for each row passed to the UDF, the UDF returns a value. UDFs accept 0 or more parameters.
You can create a custom scalar user-defined function (UDF) using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing Amazon Redshift functions.
The documentation states only scalar udf function is possible (see http://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html).
However you may cheat if the value list is not too huge by creating a string scalar udf expecting a string list, result of LISTAGG function execution.
eg: select udfSum(listagg(val,'|')) from table;
see: http://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With