Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simplistic way to extract numbers from a string following certain rules?

I need to pull numbers from a string and put them into a list, there are some rules to this however such as identifying if the extracted number is a Integer or Float.

The task sounds simple enough but I am finding myself more and more confused as time goes by and could really do with some guidance.


Take the following test string as an example:

There are test values: P7 45.826.53.91.7, .5, 66.. 4 and 5.40.3.

The rules to follow when parsing the string are as follows:

  • numbers cannot be preceeded by a letter.

  • If it finds a number and is not followed by a decimal point then the number is as an Integer.

  • If it finds a number and is followed by a decimal point then the number is a float, eg 5.

  • ~ If more numbers follow the decimal point then the number is still a float, eg 5.40

  • ~ A further found decimal point should then break up the number, eg 5.40.3 becomes (5.40 Float) and (3 Float)

  • In the event of a letter for example following a decimal point, eg 3.H then still add 3. as a Float to the list (even if technically it is not valid)

Example 1

To make this a little more clearer, taking the test string quoted above the desired output should be as follows:

enter image description here

From the image above, light blue colour illustrates Float numbers, pale red illustrates single Integers (but note also how Floats joined together are split into seperate Floats).

  • 45.826 (Float)
  • 53.91 (Float)
  • 7 (Integer)
  • 5 (Integer)
  • 66 . (Float)
  • 4 (Integer)
  • 5.40 (Float)
  • 3 . (Float)

Note there are deliberate spaces between 66 . and 3 . above due to the way the numbers were formatted.

Example 2:

Anoth3r Te5.t string .4 abc 8.1Q 123.45.67.8.9

enter image description here

  • 4 (Integer)
  • 8.1 (Float)
  • 123.45 (Float)
  • 67.8 (Float)
  • 9 (Integer)

To give a better idea, I created a new project whilst testing which looks like this:

enter image description here


Now onto the actual task. I thought maybe I could read each character from the string and identify what are valid numbers as per the rules above, and then pull them into a list.

To my ability, this was the best I could manage:

enter image description here

The code is as follows:

unit Unit1;

{$mode objfpc}{$H+}

interface

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls;

type
  TForm1 = class(TForm)
    btnParseString: TButton;
    edtTestString: TEdit;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    lstDesiredOutput: TListBox;
    lstActualOutput: TListBox;
    procedure btnParseStringClick(Sender: TObject);
  private
    FDone: Boolean;
    FIdx: Integer;
    procedure ParseString(const Str: string; var OutValue, OutKind: string);
  public
    { public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.lfm}

{ TForm1 }

procedure TForm1.ParseString(const Str: string; var OutValue, OutKind: string);
var
  CH1, CH2: Char;
begin
  Inc(FIdx);
  CH1 := Str[FIdx];

  case CH1 of
    '0'..'9': // Found a number
    begin
      CH2 := Str[FIdx - 1];
      if not (CH2 in ['A'..'Z']) then
      begin
        OutKind := 'Integer';

        // Try to determine float...

        //while (CH1 in ['0'..'9', '.']) do
        //begin
        //  case Str[FIdx] of
        //    '.':
        //    begin
        //      CH2 := Str[FIdx + 1];
        //      if not (CH2 in ['0'..'9']) then
        //      begin
        //        OutKind := 'Float';
        //        //Inc(FIdx);
        //      end;
        //    end;
        //  end;
        //end;
      end;
      OutValue := Str[FIdx];
    end;
  end;

  FDone := FIdx = Length(Str);
end;

procedure TForm1.btnParseStringClick(Sender: TObject);
var
  S, SKind: string;
begin
  lstActualOutput.Items.Clear;
  FDone := False;
  FIdx := 0;

  repeat
    ParseString(edtTestString.Text, S, SKind);
    if (S <> '') and (SKind <> '') then
    begin
      lstActualOutput.Items.Add(S + ' (' + SKind + ')');
    end;
  until
    FDone = True;
end;

end.

It clearly doesn't give the desired output (failed code has been commented out) and my approach is likely wrong but I feel I only need to make a few changes here and there for a working solution.

At this point I have found myself rather confused and quite lost despite thinking the answer is quite close, the task is becoming increasingly infuriating and I would really appreciate some help.


EDIT 1

Here I got a little closer as there is no longer duplicate numbers but the result is still clearly wrong.

enter image description here

unit Unit1;

{$mode objfpc}{$H+}

interface

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls;

type
  TForm1 = class(TForm)
    btnParseString: TButton;
    edtTestString: TEdit;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    lstDesiredOutput: TListBox;
    lstActualOutput: TListBox;
    procedure btnParseStringClick(Sender: TObject);
  private
    FDone: Boolean;
    FIdx: Integer;
    procedure ParseString(const Str: string; var OutValue, OutKind: string);
  public
    { public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.lfm}

{ TForm1 }

// Prepare to pull hair out!
procedure TForm1.ParseString(const Str: string; var OutValue, OutKind: string);
var
  CH1, CH2: Char;
begin
  Inc(FIdx);
  CH1 := Str[FIdx];

  case CH1 of
    '0'..'9': // Found the start of a new number
    begin
      CH1 := Str[FIdx];

      // make sure previous character is not a letter
      CH2 := Str[FIdx - 1];
      if not (CH2 in ['A'..'Z']) then
      begin
        OutKind := 'Integer';

        // Try to determine float...
        //while (CH1 in ['0'..'9', '.']) do
        //begin
        //  OutKind := 'Float';
        //  case Str[FIdx] of
        //    '.':
        //    begin
        //      CH2 := Str[FIdx + 1];
        //      if not (CH2 in ['0'..'9']) then
        //      begin
        //        OutKind := 'Float';
        //        Break;
        //      end;
        //    end;
        //  end;
        //  Inc(FIdx);
        //  CH1 := Str[FIdx];
        //end;
      end;
      OutValue := Str[FIdx];
    end;
  end;

  OutValue := Str[FIdx];
  FDone := Str[FIdx] = #0;
end;

procedure TForm1.btnParseStringClick(Sender: TObject);
var
  S, SKind: string;
begin
  lstActualOutput.Items.Clear;
  FDone := False;
  FIdx := 0;

  repeat
    ParseString(edtTestString.Text, S, SKind);
    if (S <> '') and (SKind <> '') then
    begin
      lstActualOutput.Items.Add(S + ' (' + SKind + ')');
    end;
  until
    FDone = True;
end;

end.

My question is how can I extract numbers from a string, add them to a list and determine if the number is integer or float?

The left pale green listbox (desired output) shows what the results should be, the right pale blue listbox (actual output) shows what we actually got.

Please advise Thanks.

Note I re-added the Delphi tag as I do use XE7 so please don't remove it, although this particular problem is in Lazarus my eventual solution should work for both XE7 and Lazarus.

like image 327
Craig Avatar asked Oct 31 '16 13:10

Craig


4 Answers

Your rules are rather complex, so you can try to build finite state machine (FSM, DFA -Deterministic finite automaton).

Every char causes transition between states.

For example, when you are in state "integer started" and meet space char, you yield integer value and FSM goes into state " anything wanted".

If you are in state "integer started" and meet '.', FSM goes into state "float or integer list started" and so on.

like image 178
MBo Avatar answered Oct 10 '22 06:10

MBo


There are so many basic errors in your code I decided to correct your homework, as it were. This is still not a good way to do it, but at least the basic errors are removed. Take care to read the comments!

procedure TForm1.ParseString(const Str: string; var OutValue,
  OutKind: string);
//var
//  CH1, CH2: Char;      <<<<<<<<<<<<<<<< Don't need these
begin
  (*************************************************
   *                                               *
   * This only corrects the 'silly' errors. It is  *
   * NOT being passed off as GOOD code!            *
   *                                               *
   *************************************************)

  Inc(FIdx);
  // CH1 := Str[FIdx]; <<<<<<<<<<<<<<<<<< Not needed but OK to use. I removed them because they seemed to cause confusion...
  OutKind := 'None';
  OutValue := '';

  try
  case Str[FIdx] of
    '0'..'9': // Found the start of a new number
    begin
      // CH1 := Str[FIdx]; <<<<<<<<<<<<<<<<<<<< Not needed

      // make sure previous character is not a letter
      // >>>>>>>>>>> make sure we are not at beginning of file
      if FIdx > 1 then
      begin
        //CH2 := Str[FIdx - 1];
        if (Str[FIdx - 1] in ['A'..'Z', 'a'..'z']) then // <<<<< don't forget lower case!
        begin
          exit; // <<<<<<<<<<<<<<
        end;
      end;
      // else we have a digit and it is not preceeded by a number, so must be at least integer
      OutKind := 'Integer';

      // <<<<<<<<<<<<<<<<<<<<< WHAT WE HAVE SO FAR >>>>>>>>>>>>>>
      OutValue := Str[FIdx];
      // <<<<<<<<<<<<< Carry on...
      inc( FIdx );
      // Try to determine float...

      while (Fidx <= Length( Str )) and  (Str[ FIdx ] in ['0'..'9', '.']) do // <<<<< not not CH1!
      begin
        OutValue := Outvalue + Str[FIdx]; //<<<<<<<<<<<<<<<<<<<<<< Note you were storing just 1 char. EVER!
        //>>>>>>>>>>>>>>>>>>>>>>>>>  OutKind := 'Float';  ***** NO! *****
        case Str[FIdx] of
          '.':
          begin
            OutKind := 'Float';
            // now just copy any remaining integers - that is all rules ask for
            inc( FIdx );
            while (Fidx <= Length( Str )) and  (Str[ FIdx ] in ['0'..'9']) do // <<<<< note '.' excluded here!
            begin
              OutValue := Outvalue + Str[FIdx];
              inc( FIdx );
            end;
            exit;
          end;
            // >>>>>>>>>>>>>>>>>>> all the rest in unnecessary
            //CH2 := Str[FIdx + 1];
            //      if not (CH2 in ['0'..'9']) then
            //      begin
            //        OutKind := 'Float';
            //        Break;
            //      end;
            //    end;
            //  end;
            //  Inc(FIdx);
            //  CH1 := Str[FIdx];
            //end;

        end;
        inc( fIdx );
      end;

    end;
  end;

  // OutValue := Str[FIdx]; <<<<<<<<<<<<<<<<<<<<< NO! Only ever gives 1 char!
  // FDone := Str[FIdx] = #0; <<<<<<<<<<<<<<<<<<< NO! #0 does NOT terminate Delphi strings

  finally   // <<<<<<<<<<<<<<< Try.. finally clause added to make sure FDone is always evaluated.
            // <<<<<<<<<< Note there are better ways!
    if FIdx > Length( Str ) then
    begin
      FDone := TRUE;
    end;
  end;
end;
like image 34
Dsm Avatar answered Oct 10 '22 08:10

Dsm


You have got answers and comments that suggest using a state machine, and I support that fully. From the code you show in Edit1, I see that you still did not implement a state machine. From the comments I guess you don't know how to do that, so to push you in that direction here's one approach:

Define the states you need to work with:

type
  TReadState = (ReadingIdle, ReadingText, ReadingInt, ReadingFloat);
  // ReadingIdle, initial state or if no other state applies
  // ReadingText, needed to deal with strings that includes digits (P7..)
  // ReadingInt, state that collects the characters that form an integer
  // ReadingFloat, state that collects characters that form a float

Then define the skeleton of your statemachine. To keep it as easy as possible I chose to use a straight forward procedural approach, with one main procedure and four subprocedures, one for each state.

procedure ParseString(const s: string; strings: TStrings);
var
  ix: integer;
  ch: Char;
  len: integer;
  str,           // to collect characters which form a value
  res: string;   // holds a final value if not empty
  State: TReadState;

  // subprocedures, one for each state
  procedure DoReadingIdle(ch: char; var str, res: string);
  procedure DoReadingText(ch: char; var str, res: string);
  procedure DoReadingInt(ch: char; var str, res: string);
  procedure DoReadingFloat(ch: char; var str, res: string);

begin
  State := ReadingIdle;
  len := Length(s);
  res := '';
  str := '';
  ix := 1;
  repeat
    ch := s[ix];
    case State of
      ReadingIdle:  DoReadingIdle(ch, str, res);
      ReadingText:  DoReadingText(ch, str, res);
      ReadingInt:   DoReadingInt(ch, str, res);
      ReadingFloat: DoReadingFloat(ch, str, res);
    end;
    if res <> '' then
    begin
      strings.Add(res);
      res := '';
    end;
    inc(ix);
  until ix > len;
  // if State is either ReadingInt or ReadingFloat, the input string
  // ended with a digit as final character of an integer, resp. float,
  // and we have a pending value to add to the list
  case State of
    ReadingInt: strings.Add(str + ' (integer)');
    ReadingFloat: strings.Add(str + ' (float)');
  end;
end;

That is the skeleton. The main logic is in the four state procedures.

  procedure DoReadingIdle(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := ch;
        State := ReadingInt;
      end;
      ' ','.': begin
        str := '';
        // no state change
      end
      else begin
        str := ch;
        State := ReadingText;
      end;
    end;
  end;

  procedure DoReadingText(ch: char; var str, res: string);
  begin
    case ch of
      ' ','.': begin  // terminates ReadingText state
        str := '';
        State := ReadingIdle;
      end
      else begin
        str := str + ch;
        // no state change
      end;
    end;
  end;

  procedure DoReadingInt(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := str + ch;
      end;
      '.': begin  // ok, seems we are reading a float
        str := str + ch;
        State := ReadingFloat;  // change state
      end;
      ' ',',': begin // end of int reading, set res
        res := str + ' (integer)';
        str := '';
        State := ReadingIdle;
      end;
    end;
  end;

  procedure DoReadingFloat(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := str + ch;
      end;
      ' ','.',',': begin  // end of float reading, set res
        res := str + ' (float)';
        str := '';
        State := ReadingIdle;
      end;
    end;
  end;

The state procedures should be self explaining. But just ask if something is unclear.

Both your test strings result in the values listed as you specified. One of your rules was a little bit ambiguous and my interpretation might be wrong.

numbers cannot be preceeded by a letter

The example you provided is "P7", and in your code you only checked the immediate previous character. But what if it would read "P71"? I interpreted it that "1" should be omitted just as the "7", even though the previous character of "1" is "7". This is the main reason for ReadingText state, which ends only on a space or period.

like image 30
Tom Brunberg Avatar answered Oct 10 '22 06:10

Tom Brunberg


Here's a solution using regex. I implemented it in Delphi (tested in 10.1, but should also work with XE8), I'm sure you can adopt it for lazarus, just not sure which regex libraries work over there. The regex pattern uses alternation to match numbers as integers or floats following your rules:

Integer:

(\b\d+(?![.\d]))
  • started by a word boundary (so no letter, number or underscore before - if underscores are an issue you could use (?<![[:alnum:]]) instead)
  • then match one or more digits
  • that are neither followed by digit nor dot

Float:

(\b\d+(?:\.\d+)?)
  • started by a word boundary (so no letter, number or underscore before - if underscores are an issue you could use (?<![[:alnum:]]) instead)
  • then match one or more digits
  • optionally match dot followed by further digits

A simple console application looks like

program Test;

{$APPTYPE CONSOLE}

uses
  System.SysUtils, RegularExpressions;

procedure ParseString(const Input: string);
var
  Match: TMatch;
begin
  WriteLn('---start---');
  Match := TRegex.Match(Input, '(\b\d+(?![.\d]))|(\b\d+(?:\.\d+)?)');
  while Match.Success do
  begin
    if Match.Groups[1].Value <> '' then
      writeln(Match.Groups[1].Value + '(Integer)')
    else
      writeln(Match.Groups[2].Value + '(Float)');
    Match := Match.NextMatch;
  end;
  WriteLn('---end---');
end;

begin
  ParseString('There are test values: P7 45.826.53.91.7, .5, 66.. 4 and 5.40.3.');
  ParseString('Anoth3r Te5.t string .4 abc 8.1Q 123.45.67.8.9');
  ReadLn;
end.
like image 38
Sebastian Proske Avatar answered Oct 10 '22 07:10

Sebastian Proske