Read CSV files faster in Julia

Tags:

I have noticed that loading a CSV file using CSV.read is quite slow. For reference, I am attaching one example of time benchmark:

Click to copy

using CSV, DataFrames
file = download("https://github.com/foursquare/twofishes")
@time CSV.read(file, DataFrame)

Output: 
9.450861 seconds (22.77 M allocations: 960.541 MiB, 5.48% gc time)
297 rows × 2 columns

This is a random dataset, and a python alternate of such operation compiles in fraction of time compared to Julia. Since, julia is faster than python why is this operation takes this much time? Moreover, is there any faster alternate to reduce the compile timing?

462

asked Jan 11 '21 01:01

Mohammad Saad

Video Answer

1 Answers

You are measuring the compile together with runtime.

One correct way to measure the time would be:

Click to copy

@time CSV.read(file, DataFrame)
@time CSV.read(file, DataFrame)

At the first run the function compiles at the second run you can use it.

Another option is using BenchmarkTools:

Click to copy

using BenchmarkTools
@btime CSV.read(file, DataFrame)

Normally, one uses Julia to work with huge datasets so that single initial compile time is not important. However, it is possible to compile CSV and DataFrame into Julia's system image and have fast execution from the first run, for isntructions see here: Why julia takes long time to import a package? (this is however more advanced usually one does not need it)

You also have yet another option which is reducing the optimization level for the compiler (this would be for scenarios where your workload is small and restarted frequently and you do not want all complexity that comes with image building. In this cage you would run Julia as:

Click to copy

julia --optimize=0 my_code.jl

Finally, like mentioned by @Oscar Smith in the forthcoming Julia 1.6 the compile times will be slightly shorter.

answered Oct 24 '22 05:10

Przemyslaw Szufel

Related questions
                            
                                Why is my swap<string,string> far slower than the std version?
                            
                                Creating a heap with heapify vs heappush. Which one is faster?
                            
                                How to replace a list of values in a numpy array?
                            
                                Can we measure successful store-forwarding with Intel's performance counters?
                            
                                How to Improve Performance of C# Object Mapping Code
                            
                                consolidated function is much slower
                            
                                Why is Python3 much slower than Python2 on my task?
                            
                                In Clojure, how can I do a performant version of `frequencies` with transducers?
                            
                                An OpenCL code in MQL5 does not get distributed jobs to each GPU core
                            
                                Optimizing a LINQ reading from System.Diagnostics.EventLog
                            
                                Fastest way to request many resources via Ajax to the same HTTP/2 server
                            
                                Efficiently return the index of the first value satisfying condition in array
                            
                                How to optimize query to compute row-dependent datetime relationships?
                            
                                What is the fastest way to select rows that contain a value in a Pandas dataframe?
                            
                                Haskell; performance of where clause
                            
                                face-api.js - Why is browser's faceapi.detectAllFaces() is faster than server's?
                            
                                Efficiently find Unique triplets of three char vectors in MATLAB
                            
                                SQL Server Memory Optimized Table - poor performance compared to temporary table
                            
                                Android Studio spawning multiple java processes! How to minimize its memory usage on a low-ram machine?
                            
                                When should I use hypot over sqrtl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read CSV files faster in Julia

Tags:

performance

time

csv

benchmarking

julia

Mohammad Saad

People also ask

Video Answer

1 Answers

Przemyslaw Szufel

Recent Activity

Donate For Us