I am bulk inserting into core data. I have a person object, and this person object has a relationship called "otherPeople" that is an NSSet of people. When bulk inserting data from a download, things were great until about 10,000 people are read in at which point the bulk insert speed slows down to a crawl. I am saving and resetting my NSManagedObjectContext every 500 inserts.
If I comment out the part that inserts the "otherPerson" relationships, the bulk insert is speedy through the entire download. parseJSON is called in batches of 500 JSONKit dictionaries.
Any ideas what might be causing this? Possible solutions?
Code:
- (NSArray*) getPeople:(NSArray*)ids
{
NSFetchRequest* request = [[[NSFetchRequest alloc] init] autorelease];
NSEntityDescription* entityDescription = [NSEntityDescription entityForName:@"Person" inManagedObjectContext:context];
[request setEntity:entityDescription];
[request setFetchBatchSize:ids.count];
//Filter by array of ids
NSPredicate* predicate = [NSPredicate predicateWithFormat:@"externalId IN %@", ids];
[request setPredicate:predicate];
NSError* _error;
NSArray* people = [context executeFetchRequest:request error:&_error];
return people;
}
- (void) parseJSON:(NSArray*)people
{
NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init];
NSMutableArray* idsToFetch = [NSMutableSet setWithCapacity:CHUNK_SIZE * 3];
NSMutableDictionary* existingPeople = [NSMutableDictionary dictionaryWithCapacity:CHUNK_SIZE * 3];
// populate the existing people dictionary first, that way we know who is already in the context without having to do a fetch for each person in the array (externalId IS indexed)
for (NSDictionary* personDictionary in people)
{
// uses JSON kit to parse out all the external ids...
[PersonJSON addExternalIdsToArray:idsToFetch fromDictionary:personDictionary];
}
// see above code for getPeople implementation...
NSArray* existingPeopleArray = [self getPeople:idsToFetch];
for (Person* p in existingPeopleArray)
{
[existingPeople setObject:p forKey:p.externalId];
}
for (NSDictionary* personDictionary in people)
{
NSString* externalId = [personDictionary objectForKey:@"PersonId"];
Person* person = [existingPeople objectForKey:externalId];
if (person == nil)
{
// the person was not in the context, make a new person in the context
person = [[self newPerson] autorelease];
person.ancestryId = externalId;
[existingPeople setObject:person forKey:person.externalId];
}
// use JSON kit to populate the core data object...
[PersonJSON populatePerson:person withDictionary:personDictionary inContext:[self context]];
// these are just objects that contain an externalId, showing that the link hasn't been setup yet
for (UnresolvedOtherPerson* other in person.unresolvedOtherPeople)
{
Person* relatedPerson = [existingPeople objectForKey:other.externalId];
if (relatedPerson == nil)
{
relatedPerson = [[self newPerson] autorelease];
relatedPerson.externalId = other.externalId;
[existingPeople setObject:relatedPerson forKey:relatedPerson.externalId];
}
// add link - if I comment out this line, everything runs very fast
// if I don't comment out, things slow down gradually and then exponentially
[person addOtherPersonsObject:relatedPerson];
}
self.downloaded++;
}
[pool drain];
}
adding object to relationship causes the relationship on both side to fire. So if you have A <<->> B and say you are trying to add a freshly created A object to a B object that already has relationship with 100,000 A objects, CoreData will fetch that 100,000 objects from the store to fulfill the relationship before adding a new relationship.
The fact that you are clearing the mangedobjectcontext every so often means that all 100,000 objects CD loaded to fulfill the relationship now needs to be reloaded all over again, making the process extremely slow.
One way to work around this problem is by doing a two-step import process. First get all the objects loaded into db without establishing any relationships, but do keep track of which relationship needs to be added. Once you do a fast import like this, then go back to the db and add the relationships and clear context in such a way to avoid core-data having to reload the relationships too often. So as a concrete example, if you need to import 1 million A
's that needs to be associated with 100 B
's, first import all the A
s, then for each of the hundred Bs, load the relationship once and add all the As to it, clear the context, move on to the next B, and so on. The key here is to prevent the context from reseting those 100k records that it just painfully loaded.
Another way to work around is to instead of resetting the whole context at regular intervals, only refresh the objects you want to get rid of.
Oh, one more thing, you could also consider having one-way relationship in CoreData, and use an explicit fetch to get the other side of the relationship
EDIT:
I think I found a workaround. You need to call the primitive accessors. so something like
[self.primitiveTags addObject:tag];
Preliminary tests seems to show that this does not force the other side of the relationship to fire
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With