Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to load a large CSV file into core data

Conclusion
Problem closed, I think.
Looks like the problem had nothing to do with the methodology, but that the XCode did not clean the project correctly in between builds.
It looks like after all those tests, the sqlite file that was being used was still the very first one that wasn't indexed......
Beware of XCode 4.3.2, I have nothing but problems with Clean not cleaning, or adding files to project not automatically being added to the bundle resources...
Thanks for the different answers..

Update 3
Since I invite anybody to just try the same steps to see if they get the same results, let me detail what I did:
I start with blank project
I defined a datamodel with one Entity, 3 attributes (2 strings, 1 float)
The first string is indexed
enter image description here

In did finishLaunchingWithOptions, I am calling:

[self performSelectorInBackground:@selector(populateDB) withObject:nil];

The code for populateDb is below:

-(void)populateDB{
NSLog(@"start");
NSPersistentStoreCoordinator *coordinator = [self persistentStoreCoordinator];
NSManagedObjectContext *context;
if (coordinator != nil) {
    context = [[NSManagedObjectContext alloc] init];
    [context setPersistentStoreCoordinator:coordinator];
}

NSString *filePath = [[NSBundle mainBundle] pathForResource:@"input" ofType:@"txt"];  
if (filePath) {  
    NSString * myText = [[NSString alloc]
                               initWithContentsOfFile:filePath
                               encoding:NSUTF8StringEncoding
                               error:nil];
    if (myText) {
        __block int count = 0;


        [myText enumerateLinesUsingBlock:^(NSString * line, BOOL * stop) {
            line=[line stringByReplacingOccurrencesOfString:@"\t" withString:@" "];
            NSArray *lineComponents=[line componentsSeparatedByString:@" "];
            if(lineComponents){
                if([lineComponents count]==3){
                    float f=[[lineComponents objectAtIndex:0] floatValue];
                    NSNumber *number=[NSNumber numberWithFloat:f];
                    NSString *string1=[lineComponents objectAtIndex:1];
                    NSString *string2=[lineComponents objectAtIndex:2];
                    NSManagedObject *object=[NSEntityDescription insertNewObjectForEntityForName:@"Bigram" inManagedObjectContext:context];
                    [object setValue:number forKey:@"number"];
                    [object setValue:string1 forKey:@"string1"];
                    [object setValue:string2 forKey:@"string2"];
                    NSError *error;
                    count++;
                    if(count>=1000){
                        if (![context save:&error]) {
                            NSLog(@"Whoops, couldn't save: %@", [error localizedDescription]);
                        }
                        count=0;

                    }
                }
            }



        }];
        NSLog(@"done importing");
        NSError *error;
        if (![context save:&error]) {
            NSLog(@"Whoops, couldn't save: %@", [error localizedDescription]);
        }

    }  
}
NSLog(@"end");
}

Everything else is default core data code, nothing added.
I run that in the simulator.
I go to ~/Library/Application Support/iPhone Simulator/5.1/Applications//Documents
There is the sqlite file that is generated

I take that and I copy it in my bundle

I comment out the call to populateDb

I edit persistentStoreCoordinator to copy the sqlite file from bundle to documents at first run

- (NSPersistentStoreCoordinator *)persistentStoreCoordinator 
{
@synchronized (self)
{
    if (__persistentStoreCoordinator != nil)
        return __persistentStoreCoordinator;

    NSString *defaultStorePath = [[NSBundle mainBundle] pathForResource:@"myProject" ofType:@"sqlite"];
    NSString *storePath = [[[self applicationDocumentsDirectory] path] stringByAppendingPathComponent: @"myProject.sqlite"];

    NSError *error;
    if (![[NSFileManager defaultManager] fileExistsAtPath:storePath]) 
    {
        if ([[NSFileManager defaultManager] copyItemAtPath:defaultStorePath toPath:storePath error:&error])
            NSLog(@"Copied starting data to %@", storePath);
        else 
            NSLog(@"Error copying default DB to %@ (%@)", storePath, error);
    }

    NSURL *storeURL = [NSURL fileURLWithPath:storePath];

    __persistentStoreCoordinator = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:[self managedObjectModel]];

    NSDictionary *options = [NSDictionary dictionaryWithObjectsAndKeys:
                             [NSNumber numberWithBool:YES], NSMigratePersistentStoresAutomaticallyOption,
                             [NSNumber numberWithBool:YES], NSInferMappingModelAutomaticallyOption, nil];

    if (![__persistentStoreCoordinator addPersistentStoreWithType:NSSQLiteStoreType configuration:nil URL:storeURL options:options error:&error]) 
    {

        NSLog(@"Unresolved error %@, %@", error, [error userInfo]);
        abort();
    }    

    return __persistentStoreCoordinator;
}    
}


I remove the app from the simulator, I check that ~/Library/Application Support/iPhone Simulator/5.1/Applications/ is now removed
I rebuild and launch again
As expected, the sqlite file is copied over to ~/Library/Application Support/iPhone Simulator/5.1/Applications//Documents

However the size of the file is smaller than in the bundle, significantly! Also, doing a simple query with a predicate like this predicate = [NSPredicate predicateWithFormat:@"string1 == %@", string1]; clearly shows that string1 is not indexed anymore

Following that, I create a new version of the datamodel, with a meaningless update, just to force a lightweight migration
If run on the simulator, the migration takes a few seconds, the database doubles in size and the same query now takes less than a second to return instead of minutes.
This would solve my problem, force a migration, but that same migration takes 3 minutes on the iPad and happens in the foreground.
So hat's where I am at right now, the best solution for me would still be to prevent the indexes to be removed, any other importing solution at launch time just takes too much time.
Let me know if you need more clarifications...

Update 2
So the best result I have had so far is to seed the core data database with the sqlite file produced from a quick tool with similar data model, but without the indexes set when producing the sqlite file. Then, I import this sqlite file in the core data app with the indexes set, and allowing for a lightweight migration. For 2 millions record on the new iPad, this migration stills take 3 minutes. The final app should have 5 times this number of records, so we're still looking at a long long processing time. If I go that route, the new question would be: can a lightweight migration be performed in the background?

Update
My question is NOT how to create a tool to populate a Core Data database, and then import the sqlite file into my app.
I know how to do this, I have done it countless times.
But until now, I had not realized that such method could have some side effect: in my case, an indexed attribute in the resulting database clearly got 'unindexed' when importing the sqlite file that way.
If you were able to verify that any indexed data is still indexed after such transfer, I am interested to know how you proceed, or otherwise what would be the best strategy to seed such database efficiently.

Original

I have a large CSV file (millions of lines) with 4 columns, strings and floats. This is for an iOS app.

I need this to be loaded into core data the first time the app is loaded.

The app is pretty much non functional until the data is available, so loading time matters, as a first time user obviously does not want the app to take 20 minutes to load before being able to run it.

Right now, my current code takes 20 min on the new iPad to process a 2 millions line csv file.

I am using a background context to not lock the UI, and save the context every 1,000 records

The first idea I had was to generate the database on the simulator, then to copy/paste it in the document folder at first launch, as this is the common non official way of seeding a large database. Unfortunately, the indexes don't seem to survive such a transfer, and although the database was available after just a few seconds, performance is terrible because my indexes were lost. I posted a question about the indexes already, but there doesn't seem to be a good answer to that.

So what I am looking for, either:

  • a way to improve performance on loading millions of records in core data
  • if the database is pre-loaded and moved at first startup, a way to keep my indexes
  • best practices for handling this kind of scenario. I don't remember using any app that requires me to wait for x minutes before first use (but maybe The Daily, and that was a terrible experience).
  • Any creative way to make the user wait without him realizing it: background import while going through tutorial, etc...
  • Not Using Core Data?
  • ...
like image 851
JP Hribovsek Avatar asked May 04 '12 06:05

JP Hribovsek


1 Answers

Pre-generate your database using an offline application (say, a command-line utility) written in Cocoa, that runs on OS X, and uses the same Core Data framework that iOS uses. You don't need to worry about "indexes surviving" or anything -- the output is a Core Data-generated .sqlite database file, directly and immediately usable by an iOS app.

As long as you can do the DB generation off-line, it's the best solution by far. I have successfully used this technique to pre-generated databases for iOS deployment myself. Check my previous questions/answers for a bit more detail.

like image 106
Shaggy Frog Avatar answered Oct 27 '22 07:10

Shaggy Frog