Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a substring pattern in Postgresql

I have a column with a lot of inconsistent strings. Some of them contain a substring with a consistent pattern of '2015mmdd_AB_CD_EFG_(text)_(text)_HIJ' which I would like to extract. I feel this is a cross over case of regexp and a substring command.

My best approach so far has been a rather ugly

substring(col_name, '........_.._.._..._.+_.+_...')

which does not end the output as desired, rahter the output is like '(...)_HIJ_blablabla'.

How can I effectively combine pattern recognition and substring selection in this case?

like image 239
thenaturalist Avatar asked Jul 13 '15 10:07

thenaturalist


1 Answers

Assuming that 2015mmdd actually means some kind of "date", so that the real data contains e.g. 20150713 the following will do:

substring(col_name, '[0-9]{8}_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')

This returns substrings that start with 8 numbers followed by an underscore, followed by two uppercase characters followed by an underscore followed by two uppercase characters, followed by an underscore followed by three uppercase characters, followed by an underscore followed by an opening parentheses followed by at lease one lowercase letter, followed by a closing parentheses, followed by an underscore, followed by an opening parentheses, followed by at least one lowercase character followed by a closing parentheses.

If 2015mmdd indeed means 2015 followed by the string mmdd then you need this:

substring(col_name, '[0-9]{4}mmdd_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')
like image 136
a_horse_with_no_name Avatar answered Sep 20 '22 11:09

a_horse_with_no_name