Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse semi-structured values

it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.

Here it goes: I have thousands of records in a .txt file, in this format:

(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),

... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.

I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.

Any help would be really appreciated.

Thanks!

like image 406
Fabio Avatar asked Jun 16 '14 21:06

Fabio


2 Answers

You can't (read probably shouldn't) use a regular expression for this. What if the parentheses contain another pair or one is mismatched?

The good news is that you can easily construct a tokenizer/parser for this. The idea is to keep track of your current state and act accordingly.

Here is a sketch for a parser I've just written here, the point is to show you the general idea. Let me know if you have any conceptual questions about it.

It works demo here but I beg you not to use it in production before understanding and patching it.


How it works

So, how do we build a parser:

var State = { // remember which state the parser is at.
     BeforeRecord:0, // at the (
     DuringInts:1, // at one of the integers
     DuringString:2, // reading the name string
     AfterRecord:3 // after the )
};

We'll need to keep track of the output, and the current working object since we'll parse these one at a time.

var records = []; // to contain the results
var state = State.BeforeRecord;

Now, we iterate the string, keep progressing in it and read the next character

for(var i = 0;i < input.length; i++){
    if(state === State.BeforeRecord){
        // handle logic when in (
    }
    ...
    if(state === State.AfterRecord){
        // handle that state
    }
}

Now, all that's left is to consume it into the object at each state:

  • If it's at ( we start parsing and skip any whitespaces
  • Read all the integers and ditch the ,
  • After four integers, read the string from ' to the next ' reaching the end of it
  • After the string, read until the ) , store the object, and start the cycle again.

The implementation is not very difficult too.


The parser

var State = { // keep track of the state
     BeforeRecord:0,
     DuringInts:1,
     DuringString:2,
     AfterRecord:3
};
var records = []; // to contain the results
var state = State.BeforeRecord;
var input = " (1, 3, 2, 1, 'John (Finances)'), (2, 7, 2, 1, 'Mary Jane'), (3, 7, 3, 2, 'Gerald (Janitor), Broflowski')," // sample input

var workingRecord = {}; // what we're reading into.

for(var i = 0;i < input.length; i++){
    var token = input[i]; // read the current input
    if(state === State.BeforeRecord){ // before reading a record
        if(token === ' ') continue; // ignore whitespaces between records
        if(token === '('){ state = State.DuringInts; continue; }
        throw new Error("Expected ( before new record");
    }
    if(state === State.DuringInts){
        if(token === ' ') continue; // ignore whitespace
        for(var j = 0; j < 4; j++){
            if(token === ' ') {token = input[++i]; j--; continue;} // ignore whitespace 
             var curNum = '';
             while(token != ","){
                  if(!/[0-9]/.test(token)) throw new Error("Expected number, got " + token);
                  curNum += token;
                  token = input[++i]; // get the next token
             }
             workingRecord[j] = Number(curNum); // set the data on the record
             token = input[++i]; // remove the comma
        }
        state = State.DuringString;
        continue; // progress the loop
    }
    if(state === State.DuringString){
         if(token === ' ') continue; // skip whitespace
         if(token === "'"){
             var str = "";
             token = input[++i];
             var lenGuard = 1000;
             while(token !== "'"){
                 str+=token;
                 if(lenGuard-- === 0) throw new Error("Error, string length bounded by 1000");
                 token = input[++i];
             }
             workingRecord.str = str;
             token = input[++i]; // remove )
             state = State.AfterRecord;
             continue;
         }
    }
    if(state === State.AfterRecord){
        if(token === ' ') continue; // ignore whitespace
        if(token === ',') { // got the "," between records
            state = State.BeforeRecord;
            records.push(workingRecord);
            workingRecord = {}; // new record;
            continue;
        }
        throw new Error("Invalid token found " + token);
    }
}
console.log(records); // logs [Object, Object, Object]
                      // each object has four numbers and a string, for example
                      // records[0][0] is 1, records[0][1] is 3 and so on,
                      // records[0].str is "John (Finances)"
like image 184
Benjamin Gruenbaum Avatar answered Oct 24 '22 03:10

Benjamin Gruenbaum


I echo Ben's sentiments about regular expressions usually being bad for this, and I completely agree with him that tokenizers are the best tool here.

However, given a few caveats, you can use a regular expression here. This is because any ambiguities in your (, ), , and ' can be attributed (AFAIK) to your final column; as all of the other columns will always be integers.

So, given:

  1. The input is perfectly formed (with no unexpected (, ), , or ').
  2. Each record is on a new line, per your edit
  3. The only new lines in your input will be to break to the next record

... the following should work (Note "new lines" here are \n. If they're \r\n, change them accordingly):

var input = /* Your input */;
var output = input.split(/\n/g).map(function (cols) {
    cols = cols.match(/^\((\d+), (\d+), (\d+), (\d+), '(.*)'\)/).slice(1);

    return cols.slice(0, 4).map(Number).concat(cols[4]);
});

The code splits on new lines, then goes through row by row and splits into cells using a regular expression, which greedily attributes as much as it can to the final cell. It then turns the first 4 elements into integers, and sticks the 5th element (the string) onto the end.

This gives you an array of records, where each record is itself an array. The first 4 elements are your PK's (as integers) and your 5th element is the string.

For example, given your input, use output[0][4] to get "Gerald (Janitor), Broflowski", and output[1][0] to get the first PK 2 for the second record (don't forget JavaScript arrays are zero-indexed).

You can see it working here: http://jsfiddle.net/56ThR/

like image 35
Matt Avatar answered Oct 24 '22 02:10

Matt