I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?
I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?
I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.
This is a possibly broad topic. Let me highlight the features that are key in my opinion.
What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays
- They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of
Any
in general which will be slower and use up more memory than having data columns having concrete types.
- You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
- The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two
DataFrame
s or building GLM model using GLM.jl).
This type of storage (heterogeneous columns with names) is a representation of table in relational databases.
What is the difference between DataFrames.jl and JuliaDB.jl
- JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using
SharedArray
but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
- JuliaDB.jl supports indexing while DataFrames.jl currently does not;
- Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
- when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
- when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).