Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CoreData and RestKit performance while importing very large datasets

I am using RestKit for fetching JSON data on various endpoints (on iOS platform).

There are several questions on SO which point to the same direction like that one:

Importing large datasets on iPhone using CoreData

But my question is still a different one, because I know, if the JSON file gets too large, I have to cut it into chunks. I'll do that!

How exactly is the importing done with CoreData in RestKit.
Seems that there is a parent/child contexts setup, which is very inefficient when importing large datasets in the shortest possible amount of time (maybe all at once at launch - no batch/lazy importing!!!).

See here a post from Florian Kugler on performant importing in CoreData (Stacks)

My question is: Can I setup a different context, apart from the parent/child contexts setup already with RestKit and run a RKManagedObjectRequestOperation importing completely async and on the other context. Then merge the context to the mainContext for fetching...

I really want to stick with CoreData instead of switching to plain SQLite, getting the most possible performance out of the combination CoreData and RestKit.

I am thrilled about your professional answers. Maybe Blake could directly answer me this question, too.

like image 725
Fab1n Avatar asked Jul 26 '13 09:07

Fab1n


1 Answers

Well, first off, if you want maximum performance, and if you really need that, don't use RestKit, don't use AFNetworking and don't use NSJSONSerialization. They all suffer from design choices which do not play well when dealing with large data sets and if your goal is maintaining a moderately low memory foot print and high performance.

You should have a very large single JSON (likely a JSON Array whose elements are JSON Objects) as the body of a single connection to get superior performance. Alternatively, you can have a custom transport format which sends multiple JSONs within one connection (say, a series of JSON Objects, separated by a "white space").

Having a large number of connections is definitely slow.

When you strive to achieve the fastest performance, you should simultaneously download, parse the JSON, create the representation and save it to the persistent store.

Note:

When doing this all in parallel, you are especially vulnerable against connection errors and keeping a consistent and logical correct data set may become a challenge. Thus, if your connection suffers from bad quality and frequent interruptions, you may first download and save the JSON file to a temporary file (also supporting HTTP range headers for the opportunity to suspend and resume a download). Sure, your performance decreases - but under this conditions, you can't make it fast anyway.

So again, when your goal is maximum performance, you should utilize all the CPUs capabilities, that is run as much in parallel as it makes sense - and this is especially the case when the connection is fast.

The JSON parser should also be able to parse "chunks" - that is partial JSON - contained in a NSData object, since this is what we get from connection:didReceiveData:.

When you receive the JSON data, you need to "map" this into a suitable representation. Usually, the known JSON parsers create a "Foundation representation". However, the faster approach is to directly create the eventual desired kind of objects from the JSON. This requires a "SAX style API" - that is basically a simplified version of a parser which sends "parse events" to a delegate or client - for example "got JSON-Array begin" or "got JSON Boolean False", etc. and custom code that receives these events and constructs the desired object on the fly.

This all requires a JSON parser having features you won't find in NSJSONSerialization: a SAX-Style API, "chunk parsing", or parsing input which is a series of JSON documents.

In order to maximize the utilization of CPU, disk and network you divide your "tasks" into CPU-bound, I/O-bound respectively network-bound operations and create as many and run as many in parallel as it is sane to the system. These tasks basically run all asynchronously, take an input, process the input, and produce an output which is the input of the next asynchronous task. The first tasks notifies the next task when it is finished, for example via completion handlers (blocks), and passes its output via parameters.

Processing incoming "chunks" of JSON data, that is parsing and creating the representation , is a CPU-bound operation. This is usually quite fast however, and I don't think that it is worth the effort to dispatch these CPU-bound tasks on all available CPUs by means of a concurrent queue.

Processing incoming "chunks" of JSON data can be implemented in basically two approaches, again with pros and cons:

Processing Partial JSON Data Asynchronously

When you get a "chunk" in connection:didReceiveData: you can asynchronously schedule this onto a different queue for processing (that is parsing and creating the representation) running on a different thread than the delegate.

Pros: the delegate immediately returns, thereby NOT blocking the delegate thread, which in turn results in fastest reading of incoming network data and moderately small network buffers. The connection is finished in the shortest possible duration.

Cons: if processing is slow compared to receiving the data, you may queue a large number of NSData objects in the block waiting to be executed in a serial dispatch queue. That will keep the allocated memory for each NSData object - and system RAM may eventually become exhausted and you may get memory warnings or crashes unless you take appropriate actions.

Processing Partial JSON Data Synchronously

When receiving a chunk of the JSON, the parser will be invoked synchronously with respect to the delegate's thread.

Pros: This avoids the memory issue when the data processing is slow compared to receiving the data. However, this may eventually stall the reading of data from the network (once the internal receive-buffer is full).

Cons: If the processing is slow and internal network buffers became full, this will increase the time the connection is active and thus increases the probability that the connection will be disconnected.

Both approaches benefit from a fast parser/representation-generator, and require a parser which can process "chunks" of JSON as a NSData object and asynchronously notifies a client when it is finished with the representation. Optionally, it should also have a "SAX style" API. There are two third party JSON parsers which I know of fulfill these requirements:

jsonlite and this

JPJson

Both are very fast (faster than JSONKit and NSJSONSerialization), support SAX style parsing and can process JSON in chunks as NSData objects. JPJson additionally can process a file containing multiple JSONs.

(Disclosure: I'm the author of JPJson)

When a representation is created, the next step is to create and initialize the managed object (unless the parser generates directly manage objects) and save the object into the persistent store. This is an I/O and CPU bound operation - but likely more CPU bound when SSD storages are used. I would schedule this process onto a separate queue, and examine how this works in conjunction with the other CPU bound operations. Depending on the speed of the network, the network becomes more CPU bound with higher bandwidth.

A scalable approach which takes bad and good connections into account, strives to maintain low memory foot-print and maximizes performance, is quite difficult to achieve, though - and a challenging programming task. Have fun! ;)

like image 200
CouchDeveloper Avatar answered Nov 04 '22 02:11

CouchDeveloper