Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you store and modify large datasets in node.js?

Basics

So basically I have written a program which generates test data for MongoDB in Node.

I have given a bounty on this question to receive more answers. To check the current solution, please scroll down to the blockquote below!

The problem

For that, the program reads a schema file and generates a specified amount of test data out of it. The problem is that this data can eventually become quite big (Think about creating 1M Users (with all properties it needs) and 20M chat messages (with userFrom and userTo) and it has to keep all of that in the RAM to modify/transform/map it and after that save it to a file.

How it works

The program works like that:

  1. Read schema file
  2. Create test data from the schema and store it in a structure (look down below for the structure)
  3. Run through this structure and link all objects referenceTo to a random object with matching referenceKey.
  4. Transform the object structure in a string[] of MongoDB insert statements
  5. Store that string[] in a file.

This is the structure of the generated test data:

export interface IGeneratedCollection {
    dbName: string,                 // Name of the database
    collectionName: string,         // Name of the collection
    documents: IGeneratedDocument[] // One collection has many documents
}

export interface IGeneratedDocument {
    documentFields: IGeneratedField [] // One document has many fields (which are recursive, because of nested documents)
}

export interface IGeneratedField {
    fieldName: string, // Name of the property
    fieldValue: any,   // Value of the property (Can also be IGeneratedField, IGeneratedField[], ...)
    fieldNeedsQuotations?: boolean, // If the Value needs to be saved with " ... "
    fieldIsObject?: boolean,        // If the Value is a object (stored as IGeneratedField[]) (To handle it different when transforming to MongoDB inserts)
    fieldIsJsonObject?: boolean,    // If the Value is a plain JSON object
    fieldIsArray?: boolean,         // If the Value is array of objects (stored as array of IGeneratedField[])
    referenceKey?: number,          // Field flagged to be a key
    referenceTo?: number            // Value gets set to a random object with matching referenceKey
}

Actual data

So in the example with 1M Users and 20M messages it would look like this:

  • 1x IGeneratedCollection (collectionName = "users")
    • 1Mx IGeneratedDocument
      • 10x IGeneratedField (For example each user has 10 fields)
  • 1x IGeneratedCollection (collectionName = "messages")
    • 20Mx IGeneratedDocument
      • 3x IGeneratedField (message, userFrom, userTo)

hich would result in 190M instances of IGeneratedField (1x1Mx10 + 1x20Mx3x = 190M).

Conclusion

This is obviously a lot to handle for the RAM as it needs to store all of that at the same time.

Temporary Solution

It now works like that:

  1. Generate 500 documents(rows in sql) at a time
  2. JSON.stringify those 500 documents and put them in a SQLite table with the schema (dbName STRING, collectionName STRING, value JSON)
  3. Remove those 500 documents from JS and let the Garbage Collector do its thing
  4. Repeat until all data is generated and in the SQLite table
  5. Take one of the rows (each containing 500 documents) at a time, apply JSON.parse and search for keys in them
  6. Repeat until all data is queried and all keys retrieved
  7. Take one of the rows at a time, apply JSON.parse and search for key references in them
  8. Apply JSON.stringify and update the row if necessary (if key references found and resolved)
  9. Repeat until all data is queried and all keys are resolved
  10. Take one of the rows at a time, apply JSON.parse and transform the documents to valid sql/mongodb inserts
  11. Add the insert (string) in a SQLite table with the schema (singleInsert STRING)
  12. Remove the old and now unused row from the SQLite table
  13. Write all inserts to file (if run from the command line) or return a dataHandle to query the data in the SQLite table (if run from other node app)

This solution does handle the problem with RAM, because SQLite automatically swaps to the Harddrive when the RAM is full

BUT

As you can see there are a lot of JSON.parse and JSON.stringify involved, which slows down the whole process drastically

What I have thought:

Maybe I should modify the IGeneratedField to only use shortend names as variables (fieldName -> fn, fieldValue -> fv, fieldIsObject -> fio, fieldIsArray -> fia, ....)

This would make the needed storage in the SQLite table smaller, BUT it would also make the code harder to read

Use a document oriented databse (But I have not really found one), to handle JSON data better

The Question

Is there any better solution to handle big objects like this in node?

Is my temporary solution OK? What is bad about it? Can it be changed to perform better?

like image 858
MauriceNino Avatar asked Jun 10 '19 01:06

MauriceNino


1 Answers

Conceptually, generate items in a stream.

You don't need all 1M users in db. You could add 10k at a time.

For the messages, random sample 2n users from db, those send messages to each other. Repeat till satisfied.

Example:

// Assume Users and Messages are both db.collections
// Assume functions generateUser() and generateMessage(u1, u2) exist.
const desiredUsers = 10000;
const desiredMessages = 5000000;
const blockSize = 1000;


(async () => {

for (const i of _.range(desiredUsers / blockSize) ) {
    const users = _.range(blockSize).map(generateUser);
    await Users.insertMany(users);
}


for (const i of _.range(desiredMessages / blockSize) ) {
    const users = await Users.aggregate([ { $sample: { size: 2 * blockSize } } ]).toArray();
    const messages = _.chunk(users, 2).map( (usr) => generateMessage(usr[0], usr[1]));
    await Messages.insertMany(messages);
}

})();

Depending on how you tweak the stream, you get a different distribution. This is uniform distribution. You can get more long tailed distribution by interleaving the users and messages. For example, you might want to do this for message boards.

Memory usage

Went to 200MB after i switched the blockSize to 1000.

Timing

like image 108
blackening Avatar answered Oct 14 '22 20:10

blackening