I’m writing Julia code whose inputs are json files, that performs analysis in (the field of mathematical finance) and writes results as json. The code is a port from R in the hope of performance improvement.
I parse the input files using JSON.parsefile
. This returns a Dict in which I observe that all vectors are of type Array{Any,1}
. As it happens, I know that the input file will never contain vectors of mixed type, such as some String
s and some Number
s.
So I wrote the following code, which seems to work well and is “safe” in the sense that if the calls to convert
fail then a vector continues to have type Array{Any,1}
.
function typenarrow!(d::Dict)
for k in keys(d)
if d[k] isa Array{Any,1}
d[k] = typenarrow(d[k])
elseif d[k] isa Dict
typenarrow!(d[k])
end
end
end
function typenarrow(v::Array{Any,1})
for T in [String,Int64,Float64,Bool,Vector{Float64}]
try
return(convert(Vector{T},v))
catch; end
end
return(v)
end
My question is: Is this worth doing? Can I expect code that processes the contents of the Dict
to execute faster if I do this type narrowing? I think the answer is yes in that the Julia performance tips recommend to “Annotate values taken from untyped locations” and this approach ensures there are no “untyped locations”.
There are two levels of the answer to this question:
Level 1
Yes, it will help the performance of the code. See for instance the following benchmark:
julia> using BenchmarkTools
julia> x = Any[1 for i in 1:10^6];
julia> y = [1 for i in 1:10^6];
julia> @btime sum($x)
26.507 ms (477759 allocations: 7.29 MiB)
1000000
julia> @btime sum($y)
226.184 μs (0 allocations: 0 bytes)
1000000
You can write your typenarrow
function using a bit simpler approach like this:
typenarrow(x) = [v for v in x]
as using the comprehension will produce a vector of concrete type (assuming your source vector is homogeneous)
Level 2
This is not fully optimal. The problem that is still left is that you have a Dict
that is a container with abstract type parameter (see https://docs.julialang.org/en/latest/manual/performance-tips/#Avoid-containers-with-abstract-type-parameters-1). Therefore in order for the computations to be fast you have to use a barrier function (see https://docs.julialang.org/en/latest/manual/performance-tips/#kernel-functions-1) or use type annotation for variables you introduce (see https://docs.julialang.org/en/v1/manual/types/index.html#Type-Declarations-1).
In the ideal world your Dict
would have keys and values of homogeneous types and all would be maximally fast then, but if I understand your code correctly values in your case are not homogeneous.
EDIT
In order to solve the Level 2 isuue you can convert Dict
into NamedTuple
like this (this is a minimal example assuming that Dict
s only nest in Dict
s directly, but it should be easy enough to extend if you want more flexibility).
First, the function performing the conversion looks like:
function typenarrow!(d::Dict)
for k in keys(d)
if d[k] isa Array{Any,1}
d[k] = [v for v in d[k]]
elseif d[k] isa Dict
d[k] = typenarrow!(d[k])
end
end
NamedTuple{Tuple(Symbol.(keys(d)))}(values(d))
end
Now a MWE of its use:
julia> using JSON
julia> x = """
{
"name": "John",
"age": 27,
"values": {
"v1": [1,2,3],
"v2": [1.5,2.5,3.5]
},
"v3": [1,2,3]
}
""";
julia> j1 = JSON.parse(x)
Dict{String,Any} with 4 entries:
"name" => "John"
"values" => Dict{String,Any}("v2"=>Any[1.5, 2.5, 3.5],"v1"=>Any[1, 2, 3])
"age" => 27
"v3" => Any[1, 2, 3]
julia> j2 = typenarrow!(j1)
(name = "John", values = (v2 = [1.5, 2.5, 3.5], v1 = [1, 2, 3]), age = 27, v3 = [1, 2, 3])
julia> dump(j2)
NamedTuple{(:name, :values, :age, :v3),Tuple{String,NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}},Int64,Array{Int64,1}}}
name: String "John"
values: NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}}
v2: Array{Float64}((3,)) [1.5, 2.5, 3.5]
v1: Array{Int64}((3,)) [1, 2, 3]
age: Int64 27
v3: Array{Int64}((3,)) [1, 2, 3]
The beauty of this approach is that Julia will know all types in j2
, so if you pass j2
to any function as a parameter all calculations inside this function will be fast.
The downside of this approach is that a function taking j2
has to be pre-compiled, which might be problematic if j2
structure is huge (as then the structure of resulting NamedTuple
is complex) and the amount of work your function does is relatively small. But for small JSON-s (small in the sense of structure, as vectors held in them can be large - their size does not add to the complexity) this approach has proven to be efficient in several applications I have developed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With