Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize String Parsing

I have a requirement to parse data files in "txf" format. The files may contain more than 1000 entries. Since the format is well defined like JSON, I wanted to make a generic parser like JSON, which can serialise and deserialise txf files.

On contrary to JSON, the mark up doesn't have a way to identify an object or an array. If an entry with same tag occurs, we need to consider it as an array.

  1. # Marks the start of an object.
  2. $ Marks the members of an object
  3. / Marks the end of an object

Following is a sample "txf" file

#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees

I was able to create a generic TXF Parser using NSScanner. But with more entries the performance needs more tweaking.

I wrote the foundation object obtained as plist and compared its performance again the parser I wrote. My parser is around 10 times slower than plist parser.

While plist file size is 5 times more than txf and has more markup characters, I feel that there is a lot of room for optimization.

Any help in that direction is highly appreciated.

EDIT : Including the parsing code

static NSString *const kArray    = @"TXFArray";
static NSString *const kBodyText = @"TXFText";

@interface TXFParser ()

/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;

/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;

@end

@implementation TXFParser

#pragma mark - Getters

- (NSMutableArray *)stack{
    if (!_stack) {
        _stack = [NSMutableArray new];
    }return _stack;
}

#pragma mark -

- (id)objectFromString:(NSString *)txfString{
    [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
        if ([string hasPrefix:@"#"]) {
            [self didStartParsingTag:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"$"]){
            [self didFindKeyValuePair:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"/"]){
            [self didEndParsingTag:[string substringFromIndex:1]];
        }else{
            //[self didFindBodyValue:string];
        }
    }]; return self.dict;
}

#pragma mark -

- (void)didStartParsingTag:(NSString *)tag{
    [self parserFoundObjectStartForKey:tag];
}

- (void)didFindKeyValuePair:(NSString *)tag{
    NSArray *components = [tag componentsSeparatedByString:@"="];
    NSString *key = [components firstObject];
    NSString *value = [components lastObject];

    if (key.length) {
        self.dict[key] = value?:@"";
    }
}

- (void)didFindBodyValue:(NSString *)bodyString{
    if (!bodyString.length) return;
    bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
    if (!bodyString.length) return;

    self.dict[kBodyText] = bodyString;
}

- (void)didEndParsingTag:(NSString *)tag{
    [self parserFoundObjectEndForKey:tag];
}

#pragma mark -

- (void)parserFoundObjectStartForKey:(NSString *)key{
    self.dict = [NSMutableDictionary new];
    [self.stack addObject:self.dict];
}

- (void)parserFoundObjectEndForKey:(NSString *)key{
    NSDictionary *dict = self.dict;

    //Remove the last value of stack
    [self.stack removeLastObject];

    //Load the previous object as dict
    self.dict = [self.stack lastObject];

    //The stack has contents, then we need to append objects
    if ([self.stack count]) {
        [self addObject:dict forKey:key];
    }else{
        //This is root object,wrap with key and assign output
        self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
    }
}

#pragma mark - Add Objects after finding end tag

- (void)addObject:(id)dict forKey:(NSString *)key{
    //If there is no value, bailout
    if (!dict) return;

    //Check if the dict already has a value for key array.
    NSMutableArray *array =  self.dict[kArray];

    //If array key is not found look for another object with same key
    if (array) {
        //Array found add current object after wrapping with key
        NSDictionary *currentDict = [self wrapObject:dict withKey:key];
        [array addObject:currentDict];
    }else{
        id prevObj = self.dict[key];
        if (prevObj) {
            /*
             There is a prev value for the same key. That means we need to wrap that object in a collection.
             1. Remove the object from dictionary,
             2. Wrap it with its key
             3. Add the prev and current value to array
             4. Save the array back to dict
             */
            [self.dict removeObjectForKey:key];
            NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
            NSDictionary *currentDict = [self wrapObject:dict withKey:key];
            self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];

        }else{
            //Simply add object to dict
            self.dict[key] = dict;
        }
    }
}

/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
    if (!key ||!obj) {
        return @{};
    }
    return @{key:obj};
}

EDIT 2:

A sample TXF file with more than 1000 entries.

like image 448
Anupdas Avatar asked Feb 01 '15 08:02

Anupdas


1 Answers

Have you considered using pull-style reads & recursive processing? That would eliminate reading the whole file into memory and also eliminate managing some own stack to keep track how deep you're parsing.

Below an example in Swift. The example works with your sample "txf", but not with the dropbox version; some of your "members" span over multiple lines. If this is a requirement, it can easily be implemented into switch/case "$" section. However, I don't see your own code handling this either. Also, the example doesn't follow the correct Swift error handling yet (the parse method would need an additional NSError parameter)

import Foundation

extension String
{
    public func indexOfCharacter(char: Character) -> Int? {
        if let idx = find(self, char) {
            return distance(self.startIndex, idx)
        }
        return nil
    }

    func substringToIndex(index:Int) -> String {
        return self.substringToIndex(advance(self.startIndex, index))
    }
    func substringFromIndex(index:Int) -> String {
        return self.substringFromIndex(advance(self.startIndex, index))
    }
}


func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary<String,AnyObject> {
    var dict = Dictionary<String,AnyObject>()

    while let line = aStreamReader.nextLine() {

        let firstChar = first(line)
        let theRest = dropFirst(line)

        switch firstChar! {
        case "$":
            if let idx = theRest.indexOfCharacter("=") {
                let key = theRest.substringToIndex(idx)
                let value = theRest.substringFromIndex(idx+1)

                dict[key] = value
            } else {
                println("no = sign")
            }
        case "#":
            let subDict = parse(aStreamReader,theRest)

            var list = dict[theRest] as? [Dictionary<String,AnyObject>]
            if list == nil {
                dict[theRest] = [subDict]
            } else {
                list!.append(subDict)
            }
        case "/":
            if theRest != parentTagName {
                println("mismatch... [\(theRest)] != [\(parentTagName)]")
            } else {
                return dict
            }
        default:
            println("mismatch... [\(line)]")
        }
    }

    println("shouldn't be here...")
    return dict
}


var data : Dictionary<String,AnyObject>?

if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") {

    if var line = aStreamReader.nextLine() {
        let tagName = line.substringFromIndex(advance(line.startIndex, 1))
        data = parse(aStreamReader, tagName)
    }

    aStreamReader.close()
}

println(JSON(data!))

And the StreamReader was borrowed from https://stackoverflow.com/a/24648951/95976

Edit

  • see full code https://github.com/tofi9/QuickParser
  • pull-style line-by-line read in objective-c: How to read data from NSFileHandle line by line?

Edit 2

I rewrote the above in C++11 and got it to run in less than 0.05 seconds (release mode) on a 2012 MBA I5 using the updated file on dropbox. I suspect NSDictionary and NSArray must have some penalty. The code below can be compiled into an objective-c project (file needs have extension .mm):

#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <map>
#include <vector>

using namespace std;


class benchmark {

private:
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;

    clock::time_point start;

public:
    benchmark(bool startCounting = true) {
        if(startCounting)
            start = clock::now();
    }

    void reset() {
        start = clock::now();
    }

    double elapsed() {
        milliseconds ms = std::chrono::duration_cast<milliseconds>(clock::now() - start);
        double elapsed_secs = ms.count() / 1000.0;
        return elapsed_secs;
    }
};

struct obj {
    map<string,string> properties;
    map<string,vector<obj>> subObjects;
};

obj parse(ifstream& stream, string& parentTagName) {
    obj obj;
    string line;
    while (getline(stream, line))
    {
        auto firstChar = line[0];
        auto rest = line.substr(1);

        switch (firstChar) {
            case '$': {
                auto idx = rest.find_first_of('=');

                if (idx == -1) {
                    ostringstream o;
                    o << "no = sign: " << line;
                    throw o.str();
                }
                auto key = rest.substr(0,idx);
                auto value = rest.substr(idx+1);
                obj.properties[key] = value;
                break;
            }
            case '#': {
                auto subObj = parse(stream, rest);
                obj.subObjects[rest].push_back(subObj);
                break;
            }
            case '/':
                if(rest != parentTagName) {
                    ostringstream o;
                    o << "mismatch end of object " << rest << " != " << parentTagName;
                    throw o.str();
                } else {
                    return obj;
                }
                break;
            default:
                ostringstream o;
                o << "mismatch line " << line;
                throw o.str();
                break;
        }

    }

    throw "I don't know why I'm here. Probably because the file is missing an end of object marker";
}


void visualise(obj& obj, int indent = 0) {
    for(auto& property : obj.properties) {
        cout << string(indent, '\t') << property.first << " = " << property.second << endl;
    }

    for(auto& subObjects : obj.subObjects) {
        for(auto& subObject : subObjects.second) {
            cout << string(indent, '\t') << subObjects.first << ": " << endl;
            visualise(subObject, indent + 1);
        }
    }
}

int main(int argc, const char * argv[]) {
    try {
        obj result;

        benchmark b;
        ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf");
        string line;
        if (getline(stream, line))
        {
            string tagName = line.substr(1);
            result = parse(stream, tagName);
        }

        cout << "elapsed " << b.elapsed() <<  " ms" << endl;

        visualise(result);

    }catch(string s) {
        cout << "error " << s;
    }

    return 0;
}

Edit 3

See link for full code C++: https://github.com/tofi9/TxfParser

like image 165
tofi9 Avatar answered Oct 07 '22 15:10

tofi9