Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Oracle - need to extract text between given strings

Example - need to extract everything between "Begin begin" and "End end". I tried this way:

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
  from phrases
       ;

Result: Hello, World!

However it fails if my text contains new line characters. Any tip how to fix this to allow extracting text containing also new lines?

[edit]How does it fail:

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, 
  World!End end It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
  from phrases
       ;

Result:

stackoverflow is awesome. Begin beginHello, World!End end It has everything!

Should be:

Hello,
World!

[edit]

Another issue. Let's see to this sample:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTESTESTES' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

Result:

Hello,
World!End end It has everything!

So it matches last occurence of end string and this is not what I want. Subsgtring should be extreacted to first occurence of my label, so result should be:

Hello,
World!

Everything after first occurence of label string should be ignored. Any ideas?

like image 481
user1209216 Avatar asked Feb 23 '15 13:02

user1209216


People also ask

How extract part of a string in Oracle?

Use a SUBSTR() function. The first argument is the string or the column name. The second argument is the index of the character at which the substring should begin. The third argument is the length of the substring.

How do I extract a string between two characters?

To extract part string between two different characters, you can do as this: Select a cell which you will place the result, type this formula =MID(LEFT(A1,FIND(">",A1)-1),FIND("<",A1)+1,LEN(A1)), and press Enter key. Note: A1 is the text cell, > and < are the two characters you want to extract string between.

What is the use of REGEXP_SUBSTR in Oracle?

REGEXP_SUBSTR extends the functionality of the SUBSTR function by letting you search a string for a regular expression pattern. It is also similar to REGEXP_INSTR , but instead of returning the position of the substring, it returns the substring itself.

Is there a split function in Oracle?

Description This is a small pipelined table function that gets one string that includes a delimited list of values, and returns these values as a table.


2 Answers

I'm not that familiar with the POSIX [[:print:]] character class but I got your query functioning using the wildcard .. You need to specify the n match parameter in REGEXP_REPLACE() so that . can match the newline character:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

I used the \1 backreference as I didn't see the need to capture the other groups from the regular expression. It might also be a good idea to use the * quantifier (instead of +) in case there is nothing preceding or following the delimiters. If you want to capture all of the groups then you can use the following:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '(.+Begin begin)(.+)(End end.+)', '\2', 1, 1, 'n')
  FROM phrases;

UPDATE - FYI, I tested with [[:print:]] and it doesn't work. This is not surprising since [[:print:]] is supposed to match printable characters. It doesn't match anything with an ASCII value below 32 (a space). You need to use ..

UPDATE #2 - per update to question - I don't think a regex will work the way you want it to. Adding the lazy quantifier to (.+) has no effect and Oracle regular expressions don't have lookahead. There are a couple of things you might do, one is to use INSTR() and SUBSTR():

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT SUBSTR(phrase, str_start, str_end - str_start) FROM (
    SELECT INSTR(phrase, 'Begin begin') + LENGTH('Begin begin') AS str_start
         , INSTR(phrase, 'End end') AS str_end, phrase
      FROM phrases
);

Another is to combine INSTR() and SUBSTR() with a regular expression:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(SUBSTR(phrase, 1, INSTR(phrase, 'End end') + LENGTH('End end')), '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;
like image 173
David Faber Avatar answered Sep 28 '22 06:09

David Faber


Try this regex:

([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)

Sample usage:

SELECT regexp_replace(
         phrase ,
         '([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)',
         '\2',
         1,  -- Start at the beginning of the phrase
         0,  -- Replace ALL occurences
         'n' -- Let dot meta character matches new line character
)
FROM
  (SELECT 'stackoverflow is awesome. Begin beginHello, '
    || chr(10)
    || ' World!End end It has everything!' AS phrase
  FROM DUAL
  )

The dot meta character (.) matches any character in the database character set and the new line character. However, when regexp_replace is called, the match_parameter must contain n switch for dot matches new lines.

like image 27
Stephan Avatar answered Sep 28 '22 07:09

Stephan