I often find myself reading a large JSON file (usually an array of objects) then manipulating each object and writing back to a new file.
To achieve this in Node (at least the reading the data portion) I usually do something like this using the stream-json module.
const fs = require('fs');
const StreamArray = require('stream-json/streamers/StreamArray');
const pipeline = fs.createReadStream('sample.json')
.pipe(StreamArray.withParser());
pipeline.on('data', data => {
//do something with each object in file
});
I've recently discovered Deno and would love to be able to do this workflow with Deno.
It looks like the readJSON method from the Standard Library reads the entire contents of the file into memory so I don't know if it would be a good fit for processing a large file.
Is there a way this can be done by streaming the data from the file using some of the lower level methods that are built into Deno?
How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB. However, JSON often changes how data modeling is done, and deserves a slightly longer response.
Users can follow the below steps to open JSON files in Chrome or Firefox browsers. Right-click on the JSON file. Choose open with option from the menu. From the drop-down menu either choose Chrome or Firefox.
The maximum length of a JSON type using a binary storage format is 16776192 bytes.
The default is 2097152 characters, which is equivalent to 4 MB of Unicode string data. JSON itself has no inherent limit.
Circling back on this now that Deno 1.0 is out and in case anyone else is interested in doing something like this. I was able to piece together a small class that works for my use case. It's not nearly as robust as something like the stream-json
package but it handles large JSON arrays just fine.
import { EventEmitter } from "https://deno.land/std/node/events.ts";
export class JSONStream extends EventEmitter {
private openBraceCount = 0;
private tempUint8Array: number[] = [];
private decoder = new TextDecoder();
constructor (private filepath: string) {
super();
this.stream();
}
async stream() {
console.time("Run Time");
let file = await Deno.open(this.filepath);
//creates iterator from reader, default buffer size is 32kb
for await (const buffer of Deno.iter(file)) {
for (let i = 0, len = buffer.length; i < len; i++) {
const uint8 = buffer[ i ];
//remove whitespace
if (uint8 === 10 || uint8 === 13 || uint8 === 32) continue;
//open brace
if (uint8 === 123) {
if (this.openBraceCount === 0) this.tempUint8Array = [];
this.openBraceCount++;
};
this.tempUint8Array.push(uint8);
//close brace
if (uint8 === 125) {
this.openBraceCount--;
if (this.openBraceCount === 0) {
const uint8Ary = new Uint8Array(this.tempUint8Array);
const jsonString = this.decoder.decode(uint8Ary);
const object = JSON.parse(jsonString);
this.emit('object', object);
}
};
};
}
file.close();
console.timeEnd("Run Time");
}
}
Example usage
const stream = new JSONStream('test.json');
stream.on('object', (object: any) => {
// do something with each object
});
Processing a ~4.8 MB json file with ~20,000 small objects in it
[
{
"id": 1,
"title": "in voluptate sit officia non nesciunt quis",
"urls": {
"main": "https://www.placeholder.com/600/1b9d08",
"thumbnail": "https://www.placeholder.com/150/1b9d08"
}
},
{
"id": 2,
"title": "error quasi sunt cupiditate voluptate ea odit beatae",
"urls": {
"main": "https://www.placeholder.com/600/1b9d08",
"thumbnail": "https://www.placeholder.com/150/1b9d08"
}
}
...
]
Took 127 ms.
❯ deno run -A parser.ts
Run Time: 127ms
I think that a package like stream-json
would be as useful on Deno as it is on NodeJs, so one way to go might surely be to grab the source code of that package and make it work on Deno. (And this answer will be outdated soon, because there are lots of people out there who do such things and it won't take long until someone – maybe you – makes their result public and importable into any Deno script.)
Alternatively, although this doesn't directly answer your question, a common pattern to treat large data sets of Json data is to have files which contain Json objects separated by newlines. (One Json object per line.) For example, Hadoop and Spark, AWS S3 select, and probably many others use this format. If you can get your input data in that format, that might help you to use a lot more tools. Also you could then stream the data with the readString('\n')
method in Deno's standard library: https://github.com/denoland/deno_std/blob/master/io/bufio.ts
Has the additional advantage of less dependency on third-party packages. Example code:
import { BufReader } from "https://deno.land/std/io/bufio.ts";
async function stream_file(filename: string) {
const file = await Deno.open(filename);
const bufReader = new BufReader(file);
console.log('Reading data...');
let line: string;
let lineCount: number = 0;
while ((line = await bufReader.readString('\n')) != Deno.EOF) {
lineCount++;
// do something with `line`.
}
file.close();
console.log(`${lineCount} lines read.`)
}
this is the code I used for a file with 13,147,089 lines of text.
Notice it's same as Roberts's code but used readLine() instead of readString('\n').
readLine()
is a low-level line-reading primitive. Most callers should use readString('\n')
instead or use a Scanner.`
import { BufReader } from "https://deno.land/std/io/bufio.ts";
export async function stream_file(filename: string) {
const file = await Deno.open(filename);
const bufReader = new BufReader(file);
console.log("Reading data...");
let line: string | any;
let lineCount: number = 0;
while ((line = await bufReader.readLine()) != Deno.EOF) {
lineCount++;
// do something with `line`.
}
file.close();
console.log(`${lineCount} lines read.`);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With