Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading large JSON file in Deno

Tags:

deno

I often find myself reading a large JSON file (usually an array of objects) then manipulating each object and writing back to a new file.

To achieve this in Node (at least the reading the data portion) I usually do something like this using the stream-json module.

const fs = require('fs');
const StreamArray = require('stream-json/streamers/StreamArray');

const pipeline = fs.createReadStream('sample.json')
  .pipe(StreamArray.withParser());

pipeline.on('data', data => {
    //do something with each object in file
});

I've recently discovered Deno and would love to be able to do this workflow with Deno.

It looks like the readJSON method from the Standard Library reads the entire contents of the file into memory so I don't know if it would be a good fit for processing a large file.

Is there a way this can be done by streaming the data from the file using some of the lower level methods that are built into Deno?

like image 582
Billy Kirk Avatar asked Sep 23 '19 21:09

Billy Kirk


People also ask

Is there a limit to JSON file size?

How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB. However, JSON often changes how data modeling is done, and deserves a slightly longer response.

What is the best way to view a JSON file?

Users can follow the below steps to open JSON files in Chrome or Firefox browsers. Right-click on the JSON file. Choose open with option from the menu. From the drop-down menu either choose Chrome or Firefox.

How big can a JSON array be?

The maximum length of a JSON type using a binary storage format is 16776192 bytes.

How many rows can JSON handle?

The default is 2097152 characters, which is equivalent to 4 MB of Unicode string data. JSON itself has no inherent limit.


3 Answers

Circling back on this now that Deno 1.0 is out and in case anyone else is interested in doing something like this. I was able to piece together a small class that works for my use case. It's not nearly as robust as something like the stream-json package but it handles large JSON arrays just fine.

import { EventEmitter } from "https://deno.land/std/node/events.ts";

export class JSONStream extends EventEmitter {

    private openBraceCount = 0;
    private tempUint8Array: number[] = [];
    private decoder = new TextDecoder();

    constructor (private filepath: string) {
        super();
        this.stream();
    }

    async stream() {
        console.time("Run Time");
        let file = await Deno.open(this.filepath);
        //creates iterator from reader, default buffer size is 32kb
        for await (const buffer of Deno.iter(file)) {

            for (let i = 0, len = buffer.length; i < len; i++) {
                const uint8 = buffer[ i ];

                //remove whitespace
                if (uint8 === 10 || uint8 === 13 || uint8 === 32) continue;

                //open brace
                if (uint8 === 123) {
                    if (this.openBraceCount === 0) this.tempUint8Array = [];
                    this.openBraceCount++;
                };

                this.tempUint8Array.push(uint8);

                //close brace
                if (uint8 === 125) {
                    this.openBraceCount--;
                    if (this.openBraceCount === 0) {
                        const uint8Ary = new Uint8Array(this.tempUint8Array);
                        const jsonString = this.decoder.decode(uint8Ary);
                        const object = JSON.parse(jsonString);
                        this.emit('object', object);
                    }
                };
            };
        }
        file.close();
        console.timeEnd("Run Time");
    }
}

Example usage

const stream = new JSONStream('test.json');

stream.on('object', (object: any) => {
    // do something with each object
});

Processing a ~4.8 MB json file with ~20,000 small objects in it

[
    {
      "id": 1,
      "title": "in voluptate sit officia non nesciunt quis",
      "urls": {
         "main": "https://www.placeholder.com/600/1b9d08",
         "thumbnail": "https://www.placeholder.com/150/1b9d08"
      }
    },
    {
      "id": 2,
      "title": "error quasi sunt cupiditate voluptate ea odit beatae",
      "urls": {
          "main": "https://www.placeholder.com/600/1b9d08",
          "thumbnail": "https://www.placeholder.com/150/1b9d08"
      }
    }
    ...
]

Took 127 ms.

❯ deno run -A parser.ts
Run Time: 127ms
like image 188
Billy Kirk Avatar answered Oct 17 '22 19:10

Billy Kirk


I think that a package like stream-json would be as useful on Deno as it is on NodeJs, so one way to go might surely be to grab the source code of that package and make it work on Deno. (And this answer will be outdated soon, because there are lots of people out there who do such things and it won't take long until someone – maybe you – makes their result public and importable into any Deno script.)

Alternatively, although this doesn't directly answer your question, a common pattern to treat large data sets of Json data is to have files which contain Json objects separated by newlines. (One Json object per line.) For example, Hadoop and Spark, AWS S3 select, and probably many others use this format. If you can get your input data in that format, that might help you to use a lot more tools. Also you could then stream the data with the readString('\n') method in Deno's standard library: https://github.com/denoland/deno_std/blob/master/io/bufio.ts

Has the additional advantage of less dependency on third-party packages. Example code:

    import { BufReader } from "https://deno.land/std/io/bufio.ts";

    async function stream_file(filename: string) {
        const file = await Deno.open(filename);
        const bufReader = new BufReader(file);
        console.log('Reading data...');
        let line: string;
        let lineCount: number = 0;
        while ((line = await bufReader.readString('\n')) != Deno.EOF) {
            lineCount++;
            // do something with `line`.
        }
        file.close();
        console.log(`${lineCount} lines read.`)
    }
like image 34
Robert Jack Will Avatar answered Oct 17 '22 18:10

Robert Jack Will


this is the code I used for a file with 13,147,089 lines of text. Notice it's same as Roberts's code but used readLine() instead of readString('\n'). readLine() is a low-level line-reading primitive. Most callers should use readString('\n') instead or use a Scanner.`

import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readLine()) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}
like image 40
Saeid Ostad Avatar answered Oct 17 '22 20:10

Saeid Ostad