I have noticed that loading a CSV file using CSV.read
is quite slow.
For reference, I am attaching one example of time benchmark:
using CSV, DataFrames
file = download("https://github.com/foursquare/twofishes")
@time CSV.read(file, DataFrame)
Output:
9.450861 seconds (22.77 M allocations: 960.541 MiB, 5.48% gc time)
297 rows × 2 columns
This is a random dataset, and a python alternate of such operation compiles in fraction of time compared to Julia. Since, julia is faster than python why is this operation takes this much time? Moreover, is there any faster alternate to reduce the compile timing?
If you always need all data from a single table (like for application settings ), CSV is faster, otherwise not.
First, you need to Install CSV Package using the following commands on the Julia command line: CSV Package is a built-in package with a defined “N” number of methods to perform n operations. Now you have to Save your data into the CSV file. Here we will use the CSV package and read () method in order to read the contents of the CSV File:
Such a format is not readable using CSV.jl with ignorerepeated=true and you would have to do further operations to turn such a file into a data frame. Are you impressed with Julia features? Would you like to try for yourself? Installing julia on your computer is easy.
CSV loaded to the Julia DataFrame displayed in Jupyter notebook. Image by Author. You can see that Julia representation (unlike python pandas) displays the data type of the column, whether it is a string, int or float. The second possibility is to use Julia’a pipe operator |> to pass the CSV.File to a DataFrame. The result is the same.
Reading a File can be done in multiple ways with the use of pre-defined functions in Julia. Files can be read line by line, in the form of strings or the whole file at once. Suppose a file has n lines within.
You are measuring the compile together with runtime.
One correct way to measure the time would be:
@time CSV.read(file, DataFrame)
@time CSV.read(file, DataFrame)
At the first run the function compiles at the second run you can use it.
Another option is using BenchmarkTools
:
using BenchmarkTools
@btime CSV.read(file, DataFrame)
Normally, one uses Julia to work with huge datasets so that single initial compile time is not important. However, it is possible to compile CSV and DataFrame into Julia's system image and have fast execution from the first run, for isntructions see here: Why julia takes long time to import a package? (this is however more advanced usually one does not need it)
You also have yet another option which is reducing the optimization level for the compiler (this would be for scenarios where your workload is small and restarted frequently and you do not want all complexity that comes with image building. In this cage you would run Julia as:
julia --optimize=0 my_code.jl
Finally, like mentioned by @Oscar Smith in the forthcoming Julia 1.6 the compile times will be slightly shorter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With