Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL various clean up of string \ varchar

I have to clean up some varchar in the following manner:

  1. Remove special characters such as: !, @, #, $, %, ^, &, *, (, ), }, {, [, ], ",", ., ?, /, ', from a closed list. I've managed to do so with a mass use of replace\regexp_replace but I'm looking for something similar to the one in SQL server.

  2. Remove following numbers but not adjacent ones meaning:

    round 1 --> round
    round1 --> round1
    round 12345 --> round
    round12345 --> round12345

  3. Remove words out of a closed list of words such as: "and", "or", "Ltd.", "international" without substrings:

    more food or drinks ---> more food drinks. and not --> me food or drinks

I'm working on a big table and I'd like to do it as efficient as possible.
Should I write a function that does that or is there a more elegant way?

like image 336
gilibi Avatar asked Dec 26 '22 11:12

gilibi


2 Answers

I'll play along. Here's for question 2:

SELECT trim(regexp_replace(
   '12 foo1 run 1457 ABC 1Foo 2456 bar 34',
   '\s*\m\d+\M\s*',
   ' ',
   'g'
   ));

Returns:

foo1 run ABC 1Foo bar

I updated the answer to use constraint expressions instead of the bracket expressions, because the manual informs:

The constraint escapes described below are usually preferable; they are no more standard, but are easier to type.

\s* .. zero or more white space
\m .. beginning of a word (same as [[:<:]])
\d+ .. 1 or more digits
\M .. end of a word (same as [[:>:]])

The 4th parameter 'g' is needed to replace "globally", not just the first occurrence.

->sqlfiddle for v9.2
->sqlfiddle for v8.4 doesn't work

Why?

standard_conforming_strings. The default changed with v9.1.

This works in both worlds, "compatibility mode" so to say. But the syntax of the modern version above (in combination with standard_conforming_strings = on) is cleaner.

SELECT trim(regexp_replace(
  '12 foo1 run 1457 ABC 1Foo 2456 bar 34',
  E'\\s*\\m\\d+\\M\\s*',
  ' ',
  'g'
));

->sqlfiddle

like image 101
Erwin Brandstetter Avatar answered Dec 29 '22 11:12

Erwin Brandstetter


1. It is much better to replace character which are not in allowed list with spaces, like this:

select regexp_replace(
    E'aśx€ ąsd, dsa w|adwf\n  as dw dgaa[aw] asdd',
    '[^a-zA-Z0-9]',
    ' ',
    'g');

This returns

a x   sd  dsa w adwf   as dw dgaa aw  asdd

There are thousands of possible characters in Unicode — it isn't really possible to list all special characters.

Taking out multiple consecutive spaces left as exercise to the reader.

like image 29
Tometzky Avatar answered Dec 29 '22 12:12

Tometzky