Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Julia is it worth type-narrowing a dictionary returned by `JSON.parsefile`

Tags:

I’m writing Julia code whose inputs are json files, that performs analysis in (the field of mathematical finance) and writes results as json. The code is a port from R in the hope of performance improvement.

I parse the input files using JSON.parsefile. This returns a Dict in which I observe that all vectors are of type Array{Any,1}. As it happens, I know that the input file will never contain vectors of mixed type, such as some Strings and some Numbers. So I wrote the following code, which seems to work well and is “safe” in the sense that if the calls to convert fail then a vector continues to have type Array{Any,1}.

function typenarrow!(d::Dict)
    for k in keys(d)
        if d[k] isa Array{Any,1}
            d[k] = typenarrow(d[k])
        elseif d[k] isa Dict
            typenarrow!(d[k])
        end
    end
end

function typenarrow(v::Array{Any,1})
    for T in [String,Int64,Float64,Bool,Vector{Float64}]
        try
            return(convert(Vector{T},v))
        catch; end        
    end
    return(v)
end

My question is: Is this worth doing? Can I expect code that processes the contents of the Dict to execute faster if I do this type narrowing? I think the answer is yes in that the Julia performance tips recommend to “Annotate values taken from untyped locations” and this approach ensures there are no “untyped locations”.

like image 761
Philip Swannell Avatar asked Feb 15 '19 12:02

Philip Swannell


1 Answers

There are two levels of the answer to this question:

Level 1

Yes, it will help the performance of the code. See for instance the following benchmark:

julia> using BenchmarkTools

julia> x = Any[1 for i in 1:10^6];

julia> y = [1 for i in 1:10^6];

julia> @btime sum($x)
  26.507 ms (477759 allocations: 7.29 MiB)
1000000

julia> @btime sum($y)
  226.184 μs (0 allocations: 0 bytes)
1000000

You can write your typenarrow function using a bit simpler approach like this:

typenarrow(x) = [v for v in x]

as using the comprehension will produce a vector of concrete type (assuming your source vector is homogeneous)

Level 2

This is not fully optimal. The problem that is still left is that you have a Dict that is a container with abstract type parameter (see https://docs.julialang.org/en/latest/manual/performance-tips/#Avoid-containers-with-abstract-type-parameters-1). Therefore in order for the computations to be fast you have to use a barrier function (see https://docs.julialang.org/en/latest/manual/performance-tips/#kernel-functions-1) or use type annotation for variables you introduce (see https://docs.julialang.org/en/v1/manual/types/index.html#Type-Declarations-1).

In the ideal world your Dict would have keys and values of homogeneous types and all would be maximally fast then, but if I understand your code correctly values in your case are not homogeneous.

EDIT

In order to solve the Level 2 isuue you can convert Dict into NamedTuple like this (this is a minimal example assuming that Dicts only nest in Dicts directly, but it should be easy enough to extend if you want more flexibility).

First, the function performing the conversion looks like:

function typenarrow!(d::Dict)
    for k in keys(d)
        if d[k] isa Array{Any,1}
            d[k] = [v for v in d[k]]
        elseif d[k] isa Dict
            d[k] = typenarrow!(d[k])
        end
    end
    NamedTuple{Tuple(Symbol.(keys(d)))}(values(d))
end

Now a MWE of its use:

julia> using JSON

julia> x = """
       {
         "name": "John",
         "age": 27,
         "values": {
           "v1": [1,2,3],
           "v2": [1.5,2.5,3.5]
         },
         "v3": [1,2,3]
       }
       """;

julia> j1 = JSON.parse(x)
Dict{String,Any} with 4 entries:
  "name"   => "John"
  "values" => Dict{String,Any}("v2"=>Any[1.5, 2.5, 3.5],"v1"=>Any[1, 2, 3])
  "age"    => 27
  "v3"     => Any[1, 2, 3]

julia> j2 = typenarrow!(j1)
(name = "John", values = (v2 = [1.5, 2.5, 3.5], v1 = [1, 2, 3]), age = 27, v3 = [1, 2, 3])

julia> dump(j2)
NamedTuple{(:name, :values, :age, :v3),Tuple{String,NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}},Int64,Array{Int64,1}}}
  name: String "John"
  values: NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}}
    v2: Array{Float64}((3,)) [1.5, 2.5, 3.5]
    v1: Array{Int64}((3,)) [1, 2, 3]
  age: Int64 27
  v3: Array{Int64}((3,)) [1, 2, 3]

The beauty of this approach is that Julia will know all types in j2, so if you pass j2 to any function as a parameter all calculations inside this function will be fast.

The downside of this approach is that a function taking j2 has to be pre-compiled, which might be problematic if j2 structure is huge (as then the structure of resulting NamedTuple is complex) and the amount of work your function does is relatively small. But for small JSON-s (small in the sense of structure, as vectors held in them can be large - their size does not add to the complexity) this approach has proven to be efficient in several applications I have developed.

like image 68
Bogumił Kamiński Avatar answered Sep 18 '22 22:09

Bogumił Kamiński