How do professionals handle thousands, hundreds-of-thousands, or potentially millions of JSON objects?
I recently completed a small app that was requesting a dozen or so JSON objects (they were movies objects generated by me). Because I was working with few objects there was no need for me to code in a efficient manner when I wanted to parse and search the JSON for specific objects.
But if I were working on a real professional app and it received 100k JSON objects how would I even handle:
For example, lets say those 100k objects were movies where each had a type and list of actors. Would I really parse these 100K into an array that I would then loop through to find the objects of interest?
What if instead of 100k we have a million movie JSON objects in the back-end? It would seem like having a million entry array or going through the entire million objects often would be really inefficient and could slow the front-end.
For my small app I just saved all 22 JSON objects to a global array called "allMovies" and I could do a simple linear search to find what I needed, but again if I have a million movie objects I do not see how my app (in its current state) would scale.
I'm still very new to this but that's essentially my question, how to efficiently store a large number of JSON objects (that the back-end has received) and search them efficiently. I'm looking for guidance on disciplines or data structures that I could implement.
The little app I made was in node.js.
Professionals use a database.
The first thing to realize is that you are not working with JSON objects. You are working with data. JSON just happens to be the protocol you receive the data in but it could have been XML or CSV or ASN.1 or Bencoding or Protobuf - the format of the data doesn't matter, only the content matters.
Now, what type of database to use depends on the data, the rate at which you receive the data and what you want to do with the data. Sometimes you will be forced to use more than one type of database.
SQL/Relational databases excel when the data is structured or has complex relationships. A properly designed SQL database would separate different parts of the data into different tables and then define the relationships between tables - for example you would have an actors table to store all actors then a movies table to store all movies then another cast table linking actors to movies. This avoids duplication of data especially when you have huge datasets.
Hierarchical databases such as LDAP offer very fast lookups especially when implemented on massively parallel clusters. This is because the lookup routing can take advantage of the data hierarchy. Telephone systems have standardized on hierarchical databases because of this.
Document databases such as MongoDB and ElasticSearch (Lucene) excel at very fast data inserts and relatively fast queries. In the simplest case the database would simply save your JSON data directly to a new file (yes, most document databases are JSON based). However there is generally no de-duplication of data so if you have a database of movies then the actors' names will be duplicated on all movies they appear in. On the other hand if you have a database of actors then the movie titles will be duplicated. This also illustrate the fact that you need to be careful designing the structure of document databases and choose the correct root object to represent all data.
There are other database types but they tend to be more esoteric and used in very specific use-cases such as caching, logging etc.
Interesting question. There is no correct or single answer to this. Each of us will provide an answer which is based on how well a particular solution has worked.
Let me try and provide a solution and a set of steps you can take to finalize on a solution.
So above are some characteristics which I could point out, which we will use for our solution. Since we can see that the data you have should be highly searchable and in realtime, it consists mostly of JSON object data, which is dynamic and subject to change. we can go with elasticsearch or MongoDB or any other text search supporting NoSQL DB.
Now that we have a database decided we can proceed to design the data flow.
An important step here is database design and how you can effectively create references and the only person that can do this is you as you have a better understanding of the domain.
- reference 1
- reference 2
- reference 3
Step 1 - The movies object first has to parsed and inserted in the database and/or elastic search indexes. This I guess something you have done already on a smaller scale by storing the objects in all movies array which can act as a client-side buffer. When the buffer is full you can offload the allMovies array to the backend by making a REST API a call using XHR or AJAX from your app.
// Incoming movies
var newMovie = {your data from forms, other source, etc}
allMovies.push(newMovie);
if(allMovies.length >= 20 )
{
//make API call to backend.
//empty the buffer
allMovies = [];
}
//else wait for new movie
Step 2: On the backend just store the data to your database, index the most searched fields. Here is the reference for the 2 databases I mentioned.
note that elasticsearch indexing is also inserting data where MongoDB has 2 step insert and index operations.
Step 3: This can be the part where you show and let users search through your movie database. This is where you will have to create a new API which will provide the ability to perform a custom search to your users and thus the App frontend as well. There can be multiple API or a single API endpoint that accepts multiple parameters like search and sort.
GET /twitter/_search?q=tag:wow
Here are some reference resources for API design, your result may vary as per your requirement.The final Step is to integrate all of this to deliver results to your app in a seamless fashion. I will try the illustrate this flow via a simple diagram.

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With