Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here's the rules:

  1. You are starting with delimited/escaped data to split into an array.
  2. The delimiter is one arbitrary character
  3. The escape character is one arbitrary character
  4. Both the delimiter and the escape character could occur in data
  5. Regex is fine, but a good-performance solution is best
  6. Edit: Empty elements (including leading or ending delimiters) can be ignored

The code signature (in C# would be, basically)

public static string[] smartSplit(
                         string delimitedData, 
                         char delimiter, 
                         char escape) {}

The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,

Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.

like image 671
danieltalsky Avatar asked Mar 20 '09 20:03

danieltalsky


People also ask

What is escape delimiter?

An escape delimiter is simply a sequence that is recognized and ignored during parsing. Its purpose is to allow the use of escape sequences to embed byte sequences in data that would otherwise be seen as delimiter occurrences.

What is the use of \f escape sequence?

\f (Form Feed) This escape sequence is used for a form feed. Its ASCII value is 012.


1 Answers

A simple state machine is usually the easiest and fastest way. Example in Python:

def extract(input, delim, escape):
  # states
  parsing = 0
  escaped = 1

  state = parsing
  found = []
  parsed = ""

  for c in input:
    if state == parsing:
      if c == delim:
        found.append(parsed)
        parsed = ""
      elif c == escape:
        state = escaped
      else:
        parsed += c
    else: # state == escaped
       parsed += c
       state = parsing

  if parsed:
    found.append(parsed)

  return found
like image 156
sth Avatar answered Oct 21 '22 07:10

sth