it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.
Here it goes: I have thousands of records in a .txt file, in this format:
(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),
... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.
I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.
Any help would be really appreciated.
Thanks!
You can't (read probably shouldn't) use a regular expression for this. What if the parentheses contain another pair or one is mismatched?
The good news is that you can easily construct a tokenizer/parser for this. The idea is to keep track of your current state and act accordingly.
Here is a sketch for a parser I've just written here, the point is to show you the general idea. Let me know if you have any conceptual questions about it.
It works demo here but I beg you not to use it in production before understanding and patching it.
So, how do we build a parser:
var State = { // remember which state the parser is at.
BeforeRecord:0, // at the (
DuringInts:1, // at one of the integers
DuringString:2, // reading the name string
AfterRecord:3 // after the )
};
We'll need to keep track of the output, and the current working object since we'll parse these one at a time.
var records = []; // to contain the results
var state = State.BeforeRecord;
Now, we iterate the string, keep progressing in it and read the next character
for(var i = 0;i < input.length; i++){
if(state === State.BeforeRecord){
// handle logic when in (
}
...
if(state === State.AfterRecord){
// handle that state
}
}
Now, all that's left is to consume it into the object at each state:
(
we start parsing and skip any whitespaces,
'
to the next '
reaching the end of it)
, store the object, and start the cycle again.The implementation is not very difficult too.
var State = { // keep track of the state
BeforeRecord:0,
DuringInts:1,
DuringString:2,
AfterRecord:3
};
var records = []; // to contain the results
var state = State.BeforeRecord;
var input = " (1, 3, 2, 1, 'John (Finances)'), (2, 7, 2, 1, 'Mary Jane'), (3, 7, 3, 2, 'Gerald (Janitor), Broflowski')," // sample input
var workingRecord = {}; // what we're reading into.
for(var i = 0;i < input.length; i++){
var token = input[i]; // read the current input
if(state === State.BeforeRecord){ // before reading a record
if(token === ' ') continue; // ignore whitespaces between records
if(token === '('){ state = State.DuringInts; continue; }
throw new Error("Expected ( before new record");
}
if(state === State.DuringInts){
if(token === ' ') continue; // ignore whitespace
for(var j = 0; j < 4; j++){
if(token === ' ') {token = input[++i]; j--; continue;} // ignore whitespace
var curNum = '';
while(token != ","){
if(!/[0-9]/.test(token)) throw new Error("Expected number, got " + token);
curNum += token;
token = input[++i]; // get the next token
}
workingRecord[j] = Number(curNum); // set the data on the record
token = input[++i]; // remove the comma
}
state = State.DuringString;
continue; // progress the loop
}
if(state === State.DuringString){
if(token === ' ') continue; // skip whitespace
if(token === "'"){
var str = "";
token = input[++i];
var lenGuard = 1000;
while(token !== "'"){
str+=token;
if(lenGuard-- === 0) throw new Error("Error, string length bounded by 1000");
token = input[++i];
}
workingRecord.str = str;
token = input[++i]; // remove )
state = State.AfterRecord;
continue;
}
}
if(state === State.AfterRecord){
if(token === ' ') continue; // ignore whitespace
if(token === ',') { // got the "," between records
state = State.BeforeRecord;
records.push(workingRecord);
workingRecord = {}; // new record;
continue;
}
throw new Error("Invalid token found " + token);
}
}
console.log(records); // logs [Object, Object, Object]
// each object has four numbers and a string, for example
// records[0][0] is 1, records[0][1] is 3 and so on,
// records[0].str is "John (Finances)"
I echo Ben's sentiments about regular expressions usually being bad for this, and I completely agree with him that tokenizers are the best tool here.
However, given a few caveats, you can use a regular expression here. This is because any ambiguities in your (
, )
, ,
and '
can be attributed (AFAIK) to your final column; as all of the other columns will always be integers.
So, given:
(
, )
, ,
or '
).... the following should work (Note "new lines" here are \n
. If they're \r\n
, change them accordingly):
var input = /* Your input */;
var output = input.split(/\n/g).map(function (cols) {
cols = cols.match(/^\((\d+), (\d+), (\d+), (\d+), '(.*)'\)/).slice(1);
return cols.slice(0, 4).map(Number).concat(cols[4]);
});
The code splits on new lines, then goes through row by row and splits into cells using a regular expression, which greedily attributes as much as it can to the final cell. It then turns the first 4 elements into integers, and sticks the 5th element (the string) onto the end.
This gives you an array of records, where each record is itself an array. The first 4 elements are your PK's (as integers) and your 5th element is the string.
For example, given your input, use output[0][4]
to get "Gerald (Janitor), Broflowski"
, and output[1][0]
to get the first PK 2
for the second record (don't forget JavaScript arrays are zero-indexed).
You can see it working here: http://jsfiddle.net/56ThR/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With