Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streamed JSON decoding using preferably Circe and Akka Streams

My use case is similar to this entry, in wanting to read an inner, huge array (multiple gigabytes as text) from within a JSON object such as:

{ "a": "...",   // root level fields to be read, separately
  ...
  "bs": [       // the huge array, most of the payload (can be multiple GB's)
    {...},
    ...
  ]
}

The input is available as a Source[ByteString,_] (Akka stream), and I'm using Circe for JSON decoding, elsewhere.

I can see two challenges:

  1. Reading the bs array in a streamed fashion (getting a Source[B,_] for consuming it).

  2. Splitting the original stream to two, so I can read and analyse the root level fields before the array begins.

Do you have pointers to solving such a use case? I have checked akka-stream-json and circe-iteratee, so far.

akka-stream-json looks like the thing, but is not very maintained. circe-iteratee does not seem to have integration with Akka Streams.

like image 604
akauppi Avatar asked Apr 27 '26 19:04

akauppi


1 Answers

Jawn has an async parser: https://github.com/non/jawn/blob/master/parser/src/main/scala/jawn/AsyncParser.scala

But it is hard to write an efficient async parser for JSON because of its sequential origin.

If you can switch to the synchronous parsing then you can use jsoniter-scala-core and write a simple custom codec which will skip all not needed key/value pairs except "bs" and then parse required data blazingly fast without holding or array content in memory.

like image 132
Andriy Plokhotnyuk Avatar answered Apr 30 '26 10:04

Andriy Plokhotnyuk