Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract two consecutive digits from a text field in MySQL?

I have a MySQL database and I have a query as:

SELECT `id`, `originaltext` FROM `source` WHERE `originaltext` regexp '[0-9][0-9]'

This detects all originaltexts which have numbers with 2 digits in it.

I need MySQL to return those numbers as a field, so i can manipulate them further.

Ideally, if I can add additional criteria that is should be > 20 would be great, but i can do that separately as well.

like image 298
Steve Avatar asked Mar 19 '11 10:03

Steve


4 Answers

If you want more regular expression power in your database, you can consider using LIB_MYSQLUDF_PREG. This is an open source library of MySQL user functions that imports the PCRE library. LIB_MYSQLUDF_PREG is delivered in source code form only. To use it, you'll need to be able to compile it and install it into your MySQL server. Installing this library does not change MySQL's built-in regex support in any way. It merely makes the following additional functions available:

PREG_CAPTURE extracts a regex match from a string. PREG_POSITION returns the position at which a regular expression matches a string. PREG_REPLACE performs a search-and-replace on a string. PREG_RLIKE tests whether a regex matches a string.

All these functions take a regular expression as their first parameter. This regular expression must be formatted like a Perl regular expression operator. E.g. to test if regex matches the subject case insensitively, you'd use the MySQL code PREG_RLIKE('/regex/i', subject). This is similar to PHP's preg functions, which also require the extra // delimiters for regular expressions inside the PHP string.

If you want something more simpler, you could alter this function to suit better your needs.

CREATE FUNCTION REGEXP_EXTRACT(string TEXT, exp TEXT)
-- Extract the first longest string that matches the regular expression
-- If the string is 'ABCD', check all strings and see what matches: 'ABCD', 'ABC', 'AB', 'A', 'BCD', 'BC', 'B', 'CD', 'C', 'D'
-- It's not smart enough to handle things like (A)|(BCD) correctly in that it will return the whole string, not just the matching token.

RETURNS TEXT
DETERMINISTIC
BEGIN
  DECLARE s INT DEFAULT 1;
  DECLARE e INT;
  DECLARE adjustStart TINYINT DEFAULT 1;
  DECLARE adjustEnd TINYINT DEFAULT 1;

  -- Because REGEXP matches anywhere in the string, and we only want the part that matches, adjust the expression to add '^' and '$'
  -- Of course, if those are already there, don't add them, but change the method of extraction accordingly.

  IF LEFT(exp, 1) = '^' THEN 
    SET adjustStart = 0;
  ELSE
    SET exp = CONCAT('^', exp);
  END IF;

  IF RIGHT(exp, 1) = '$' THEN
    SET adjustEnd = 0;
  ELSE
    SET exp = CONCAT(exp, '$');
  END IF;

  -- Loop through the string, moving the end pointer back towards the start pointer, then advance the start pointer and repeat
  -- Bail out of the loops early if the original expression started with '^' or ended with '$', since that means the pointers can't move
  WHILE (s <= LENGTH(string)) DO
    SET e = LENGTH(string);
    WHILE (e >= s) DO
      IF SUBSTRING(string, s, e) REGEXP exp THEN
        RETURN SUBSTRING(string, s, e);
      END IF;
      IF adjustEnd THEN
        SET e = e - 1;
      ELSE
        SET e = s - 1; -- ugh, such a hack to end it early
      END IF;
    END WHILE;
    IF adjustStart THEN
      SET s = s + 1;
    ELSE
      SET s = LENGTH(string) + 1; -- ugh, such a hack to end it early
    END IF;
  END WHILE;

  RETURN NULL;

END
like image 142
Pentium10 Avatar answered Nov 01 '22 12:11

Pentium10


There isn't any syntax in MySQL for extracting text using regular expressions. You can use the REGEXP to identify the rows containing two consecutive digits, but to extract them you have to use the ordinary string manipulation functions which is very difficult in this case.

Alternatives:

  • Select the entire value from the database then use a regular expression on the client.
  • Use a different database that has better support for the SQL standard (may not be an option, I know). Then you can use this: SUBSTRING(originaltext from '%#[0-9]{2}#%' for '#').
like image 31
Mark Byers Avatar answered Nov 01 '22 10:11

Mark Byers


I'm having the same issue, and this is the solution I found (but it won't work in all cases) :

  • use LOCATE() to find the beginning and the end of the string you wan't to match
  • use MID() to extract the substring in between...
  • keep the regexp to match only the rows where you are sure to find a match.
like image 2
Greg Avatar answered Nov 01 '22 10:11

Greg


I used my code as a Stored Procedure (Function), shall work to extract any number built from digits in a single block. This is a part of my wider library.

DELIMITER $$

--  2013.04 [email protected]
--  FindNumberInText("ab 234 95 cd", TRUE) => 234  
--  FindNumberInText("ab 234 95 cd", FALSE) => 95

DROP FUNCTION IF EXISTS FindNumberInText$$
CREATE FUNCTION FindNumberInText(_input VARCHAR(64), _fromLeft BOOLEAN) RETURNS VARCHAR(32)
BEGIN
  DECLARE _r              VARCHAR(32) DEFAULT '';
  DECLARE _i              INTEGER DEFAULT 1;
  DECLARE _start          INTEGER DEFAULT 0;
  DECLARE _IsCharNumeric  BOOLEAN;

  IF NOT _fromLeft THEN SET _input = REVERSE(_input); END IF;
  _loop: REPEAT
    SET _IsCharNumeric = LOCATE(MID(_input, _i, 1), "0123456789") > 0;
    IF _IsCharNumeric THEN
      IF _start = 0 THEN SET _start  = _i;  END IF;
    ELSE
      IF _start > 0 THEN LEAVE _loop;       END IF;
    END IF;
    SET _i = _i + 1;
  UNTIL _i > length(_input) END REPEAT;

  IF _start > 0 THEN
    SET _r = MID(_input, _start, _i - _start);
    IF NOT _fromLeft THEN SET _r = REVERSE(_r);  END IF;
  END IF;
  RETURN _r;
END$$
like image 2
m227 Avatar answered Nov 01 '22 11:11

m227