Are JuliaDB or DataFrame faster than plain Array?

Question

I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?

I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?

I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.

Bogumił Kamiński · Accepted Answer

This is a possibly broad topic. Let me highlight the features that are key in my opinion.

What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays

They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of Any in general which will be slower and use up more memory than having data columns having concrete types.
You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two DataFrames or building GLM model using GLM.jl).

This type of storage (heterogeneous columns with names) is a representation of table in relational databases.

What is the difference between DataFrames.jl and JuliaDB.jl

JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using SharedArray but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
JuliaDB.jl supports indexing while DataFrames.jl currently does not;
Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
- when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
- when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).

Are JuliaDB or DataFrame faster than plain Array?

Tags:

julia

Alex Craft

Video Answer

1 Answers

What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays

What is the difference between DataFrames.jl and JuliaDB.jl

Bogumił Kamiński

Recent Activity

Donate For Us

Are JuliaDB or DataFrame faster than plain Array?

Tags:

julia

Alex Craft

Video Answer

1 Answers

What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays

What is the difference between DataFrames.jl and JuliaDB.jl

Bogumił Kamiński

Related questions

Recent Activity

Donate For Us