Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert an IndexedTable to a DataFrame in Julia?

Tags:

In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..

So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.

using DataFrames, IndexedTables, IndexedTables.Table

tn = Table(
    Columns(
        param  = String["price","price","price","price","waterContent","waterContent"],
        item   = String["banana","banana","apple","apple","banana", "apple"],
        region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
       value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
    )
)

to >>

df_tn = DataFrame(
    param     = String["price","price","price","price","waterContent","waterContent"],
    item      = String["banana","banana","apple","apple","banana", "apple"],
    region    = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
    value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)

or

t = Table(
    Columns(
        String["price","price","price","price","waterContent","waterContent"],
        String["banana","banana","apple","apple","banana", "apple"],
        Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       Float64[2.8,2.7,1.1,0.8,0.2,0.7],
       Float64[3.2,2.9,1.2,0.8,0.2,0.8],
    )
)

to >>

df_t = DataFrame(
    x1 = String["price","price","price","price","waterContent","waterContent"],
    x2 = String["banana","banana","apple","apple","banana", "apple"],
    x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
    x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)

I can find the individual "row" values interacting over the table with pair():

for (i,pair) in enumerate(pairs(tn))
    rowValues = []
    for (j,section) in enumerate(pair)
        for item in section
            push!(rowValues,item)
        end
    end
    println(rowValues)
end

I can't however get the columns names and types, and I guess working by column would instead be more efficient.

EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:

colTypes = Union{Union,DataType}[]

for item in tn.index.columns
  push!(colTypes, eltype(item))
end
for item in tn.data.columns
  push!(colTypes, eltype(item))
end

EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:

t_named_idx = Table(
    Columns(
        param  = String["price","price","price","price","waterContent","waterContent"],
        item   = String["banana","banana","apple","apple","banana", "apple"],
        region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    )
)

The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.

like image 334
Antonello Avatar asked Oct 20 '17 07:10

Antonello


People also ask

How do you create a DataFrame in Julia?

Steps to Create a DataFrame in Julia from Scratch You can then use the following template to create a DataFrame in Julia: using DataFrames df = DataFrame(column_1 = ["value_1", "value_2", "value_3", ...], column_2 = ["value_1", "value_2", "value_3", ...], column_3 = ["value_1", "value_2", "value_3", ...], ... )

Does Julia have DataFrames?

Julia has a library to handle tabular data, in a way similar to R or Pandas dataframes. The name is, no surprises, DataFrames. The approach and the function names are similar, although the way of actually accessing the API may be a bit different. For complex analysis, DataFramesMeta adds some helper macros.

What is a DataFrame in Julia?

DataFrame is a 2 dimensional mutable data structure, that is used for handling tabular data. Unlike Arrays and Matrices, a DataFrame can hold columns of different data types. The DataFrames package in Julia provides the DataFrame object which is used to hold and manipulate tabular data in a flexible and convenient way.


2 Answers

The following conversion functions:

toDataFrame(cols::Tuple, prefix="x") = 
  DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)

toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") = 
  DataFrame(;(c => cols[c] for c in fieldnames(cols))...)

toDataFrame(t::IndexedTable) = toDataFrame(columns(t))

give (on Julia 0.6 with tn and t defined as in the question):

julia> tn
param           item      region │ value2000  value2010
─────────────────────────────────┼─────────────────────
"price"         "apple"   "FR"   │ 1.1        1.2
"price"         "apple"   "UK"   │ 0.8        0.8
"price"         "banana"  "FR"   │ 2.8        3.2
"price"         "banana"  "UK"   │ 2.7        2.9
"waterContent"  "apple"   NA     │ 0.7        0.8
"waterContent"  "banana"  NA     │ 0.2        0.2

julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param          │ item     │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1   │ "price"        │ "apple"  │ "FR"   │ 1.1       │ 1.2       │
│ 2   │ "price"        │ "apple"  │ "UK"   │ 0.8       │ 0.8       │
│ 3   │ "price"        │ "banana" │ "FR"   │ 2.8       │ 3.2       │
│ 4   │ "price"        │ "banana" │ "UK"   │ 2.7       │ 2.9       │
│ 5   │ "waterContent" │ "apple"  │ NA     │ 0.7       │ 0.8       │
│ 6   │ "waterContent" │ "banana" │ NA     │ 0.2       │ 0.2       │

Type information is mostly retained:

julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}

julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}

And for unnamed columns:

julia> t
───────────────────────────────┬─────────
"price"         "apple"   "FR" │ 1.1  1.2
"price"         "apple"   "UK" │ 0.8  0.8
"price"         "banana"  "FR" │ 2.8  3.2
"price"         "banana"  "UK" │ 2.7  2.9
"waterContent"  "apple"   NA   │ 0.7  0.8
"waterContent"  "banana"  NA   │ 0.2  0.2

julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1             │ x2       │ x3   │ x4  │ x5  │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR" │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK" │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA   │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA   │ 0.2 │ 0.2 │

EDIT: As noted by @Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:

toDataFrame(t::IndexedTable) = 
  hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))

And then, the mixed case gives a result like:

julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param          │ item     │ region │ x1  │ x2  │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR"   │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK"   │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR"   │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK"   │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA     │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA     │ 0.2 │ 0.2 │
like image 124
Dan Getz Avatar answered Oct 04 '22 14:10

Dan Getz


Ugly, quick and dirty "solution" (I hope it is doable in other way):

julia> df = DataFrame(
         permutedims(  # <- structural transpose
           vcat(
             reshape([j for i in keys(t) for j in i], :, length(t)) , 
             reshape([j for i in t       for j in i], :, length(t))
           ), 
           (2,1)
         )
       )
6×5 DataFrames.DataFrame
│ Row │ x1             │ x2       │ x3   │ x4  │ x5  │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR" │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK" │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA   │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA   │ 0.2 │ 0.2 │
like image 43
Liso Avatar answered Oct 04 '22 16:10

Liso