Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.

The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).

To append to a nonempty dataframe is straightforward:

df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])

This returns.

3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 1 | 4 |
| 2   | 2 | 5 |
| 3   | 3 | 6 |

But for an empty init I get errors.

df = DataFrame(A = [], B = [])
push!(df, [3, 6])

Error message:

ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2

What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?

like image 560
cantdutchthis Avatar asked Oct 05 '14 08:10

cantdutchthis


People also ask

How do I create an empty DataFrame in Julia?

Create an empty Julia DataFrame by enclosing column names and datatype of column inside DataFrame() function. Now you can add rows one by one using push!() function. This is like row binding.

Does Julia have DataFrames?

A Data frame is a two-dimensional data structure that resembles a table, where the columns represent variables and rows contain values for those variables. It is mutable and can hold various data types.


3 Answers

A zero length array defined using only [] will lack sufficient type information.

julia> typeof([])
Array{None,1}

So to avoid that problem is to simply indicate the type.

julia> typeof(Int64[])
Array{Int64,1}

And you can apply that to your DataFrame problem

julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame

julia> push!(df, [3  6])

julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 3 | 6 |
like image 92
waTeim Avatar answered Oct 11 '22 13:10

waTeim


using Pkg, CSV, DataFrames

iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))

new_iris = similar(iris, nrow(iris))

head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ missing     │ missing    │ missing     │ missing    │ missing │
# │ 2   │ missing     │ missing    │ missing     │ missing    │ missing │

for (i, row) in enumerate(eachrow(iris))
    new_iris[i, :] = row[:]
end

head(new_iris, 2)

# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
# │ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
like image 29
The Unfun Cat Avatar answered Oct 11 '22 13:10

The Unfun Cat


The answer from @waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?

In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient. The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.

Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.

col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)

df = DataFrame(named_tuple) # 0×2 DataFrame

Alternatively, the NameTuple could be created with

# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
like image 30
wueli Avatar answered Oct 11 '22 15:10

wueli