Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading csv in Julia is slow compared to Python

Tags:

julia

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.

python 3.3 example

import pandas as pd import time start=time.time() myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False) print(time.time()-start)  Output: 19.90 

R 3.0.2 example

system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,    stringsAsFactors=F,na.strings=""))  Output: User    System  Elapsed 181.13  1.07    182.32 

Julia 0.2.0 (Julia Studio 0.4.4) example # 1

using DataFrames timing = @time myData = readtable("C:/myFile.txt",separator='|',header=false)  Output: elapsed time: 80.35 seconds (10319624244 bytes allocated) 

Julia 0.2.0 (Julia Studio 0.4.4) example # 2

timing = @time myData = readdlm("C:/myFile.txt",'|',header=false)  Output: elapsed time: 65.96 seconds (9087413564 bytes allocated) 
  1. Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?

  2. a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?

like image 722
uday Avatar asked Feb 19 '14 19:02

uday


People also ask

Are CSV files faster to handle?

If you always need all data from a single table (like for application settings ), CSV is faster, otherwise not.

Can pandas read CSV?

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.


2 Answers

The best answer is probably that I'm not as a good a programmer as Wes.

In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.

As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.

On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.

like image 154
John Myles White Avatar answered Sep 24 '22 02:09

John Myles White


There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl

like image 28
Jeff Bezanson Avatar answered Sep 26 '22 02:09

Jeff Bezanson