Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace chars in a HTML string - Except Tags

I need to go through a HTML string and replace characters with 0 (zero), except tags, spaces and line breaks. I created this code bellow, but it is too slow. Please, can someone help me to make it faster (optimize)?

procedure TForm1.btn1Click(Sender: TObject);
var
  Txt: String;
  Idx: Integer;
  Tag: Boolean;
begin
  Tag := False;
  Txt := mem1.Text;
  For Idx := 0 to Length(Txt) - 1 Do
  Begin
    If (Txt[Idx] = '<') Then
      Tag := True Else
    If (Txt[Idx] = '>') Then
    Begin
      Tag := False;
      Continue;
    end;
    If Tag Then Continue;
    If (not (Txt[Idx] in [#10, #13, #32])) Then
      Txt[Idx] := '0';
  end;
  mem2.Text := Txt;
end;

The HTML text will never have "<" or ">" outside tags (in the middle of text), so I do not need to worry about this.

Thank you!

like image 513
Guybrush Avatar asked Dec 27 '22 04:12

Guybrush


1 Answers

That looks pretty straightforward. It's hard to be sure without profiling the code against the data you're using, (which is always a good idea; if you need to optimize Delphi code, try running it through Sampling Profiler first to get an idea where you're actually spending all your time,) but if I had to make an educated guess, I'd guess that your bottleneck is in this line:

Txt[Idx] := '0';

As part of the compiler's guarantee of safe copy-on-write semantics for the string type, every write to an individual element (character) of a string involves a hidden call to the UniqueString routine. This makes sure that you're not changing a string that something else, somewhere else, holds a reference to.

In this particular case, that's not necessary, because you got the string fresh in the start of this routine and you know it's unique. There's a way around it, if you're careful.

CLEAR AND UNAMBIGUOUS WARNING: Do not do what I'm about to explain without making sure you have a unique string first! The easiest way to accomplish this is to call UniqueString manually. Also, do not do anything during the loop that could assign this string to any other variable. While we're doing this, it's not being treated as a normal string. Failure to heed this warning can cause data corruption.

OK, now that that's been explained, you can use a pointer to access the characters of the string directly, and get around the compiler's safeguards, like so:

procedure TForm1.btn1Click(Sender: TObject);
var
  Txt: String;
  Idx: Integer;
  Tag: Boolean;
  current: PChar; //pointer to a character
begin
  Tag := False;
  Txt := mem1.Text;
  UniqueString(txt); //very important
  if length(txt) = 0 then
    Exit; //If you don't check this, the next line will raise an AV on a blank string
  current := @txt[1];
  dec(current); //you need to start before element 1, but the compiler won't let you
                //assign to element 0
  For Idx := 0 to Length(Txt) - 1 Do
  Begin
    inc(current); //put this at the top of the loop, to handle Continue cases correctly
    If (current^ = '<') Then
      Tag := True Else
    If (current^ = '>') Then
    Begin
      Tag := False;
      Continue;
    end;
    If Tag Then Continue;
    If (not (current^ in [#10, #13, #32])) Then
      current^ := '0';
  end;
  mem2.Text := Txt;
end;

This changes the metaphor. Instead of indexing into the string as an array, we're treating it like a tape, with the pointer as the head, moving forward one character at a time, scanning from beginning to end, and changing the character under it when appropriate. No redundant calls to UniqueString, and no repeatedly calculating offsets, which means this can be a lot faster.

Be very careful when using pointers like this. The compiler's safety checks are there for a good reason, and using pointers steps outside of them. But sometimes, they can really help speed things up in your code. And again, profile before trying anything like this. Make sure that you know what's slowing things down, instead of just thinking you know. If it turns out to be something else that's running slow, don't do this; find a solution to the real problem instead.

like image 154
Mason Wheeler Avatar answered Jan 18 '23 20:01

Mason Wheeler