How to convert an IndexedTable to a DataFrame in Julia?

Tags:

In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..

So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.

using DataFrames, IndexedTables, IndexedTables.Table

tn = Table(
    Columns(
        param  = String["price","price","price","price","waterContent","waterContent"],
        item   = String["banana","banana","apple","apple","banana", "apple"],
        region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
       value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
    )
)

to >>

df_tn = DataFrame(
    param     = String["price","price","price","price","waterContent","waterContent"],
    item      = String["banana","banana","apple","apple","banana", "apple"],
    region    = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
    value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)

t = Table(
    Columns(
        String["price","price","price","price","waterContent","waterContent"],
        String["banana","banana","apple","apple","banana", "apple"],
        Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       Float64[2.8,2.7,1.1,0.8,0.2,0.7],
       Float64[3.2,2.9,1.2,0.8,0.2,0.8],
    )
)

to >>

df_t = DataFrame(
    x1 = String["price","price","price","price","waterContent","waterContent"],
    x2 = String["banana","banana","apple","apple","banana", "apple"],
    x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
    x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)

I can find the individual "row" values interacting over the table with pair():

for (i,pair) in enumerate(pairs(tn))
    rowValues = []
    for (j,section) in enumerate(pair)
        for item in section
            push!(rowValues,item)
        end
    end
    println(rowValues)
end

I can't however get the columns names and types, and I guess working by column would instead be more efficient.

EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:

colTypes = Union{Union,DataType}[]

for item in tn.index.columns
  push!(colTypes, eltype(item))
end
for item in tn.data.columns
  push!(colTypes, eltype(item))
end

EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:

t_named_idx = Table(
    Columns(
        param  = String["price","price","price","price","waterContent","waterContent"],
        item   = String["banana","banana","apple","apple","banana", "apple"],
        region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
    ),
    Columns(
       Float64[2.8,2.7,1.1,0.8,0.2,0.7],
    )
)

The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.

334

asked Oct 20 '17 07:10

Antonello

2 Answers

The following conversion functions:

toDataFrame(cols::Tuple, prefix="x") = 
  DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)

toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") = 
  DataFrame(;(c => cols[c] for c in fieldnames(cols))...)

toDataFrame(t::IndexedTable) = toDataFrame(columns(t))

give (on Julia 0.6 with tn and t defined as in the question):

julia> tn
param           item      region │ value2000  value2010
─────────────────────────────────┼─────────────────────
"price"         "apple"   "FR"   │ 1.1        1.2
"price"         "apple"   "UK"   │ 0.8        0.8
"price"         "banana"  "FR"   │ 2.8        3.2
"price"         "banana"  "UK"   │ 2.7        2.9
"waterContent"  "apple"   NA     │ 0.7        0.8
"waterContent"  "banana"  NA     │ 0.2        0.2

julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param          │ item     │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1   │ "price"        │ "apple"  │ "FR"   │ 1.1       │ 1.2       │
│ 2   │ "price"        │ "apple"  │ "UK"   │ 0.8       │ 0.8       │
│ 3   │ "price"        │ "banana" │ "FR"   │ 2.8       │ 3.2       │
│ 4   │ "price"        │ "banana" │ "UK"   │ 2.7       │ 2.9       │
│ 5   │ "waterContent" │ "apple"  │ NA     │ 0.7       │ 0.8       │
│ 6   │ "waterContent" │ "banana" │ NA     │ 0.2       │ 0.2       │

Type information is mostly retained:

julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}

julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}

And for unnamed columns:

julia> t
───────────────────────────────┬─────────
"price"         "apple"   "FR" │ 1.1  1.2
"price"         "apple"   "UK" │ 0.8  0.8
"price"         "banana"  "FR" │ 2.8  3.2
"price"         "banana"  "UK" │ 2.7  2.9
"waterContent"  "apple"   NA   │ 0.7  0.8
"waterContent"  "banana"  NA   │ 0.2  0.2

julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1             │ x2       │ x3   │ x4  │ x5  │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR" │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK" │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA   │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA   │ 0.2 │ 0.2 │

EDIT: As noted by @Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:

toDataFrame(t::IndexedTable) = 
  hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))

And then, the mixed case gives a result like:

julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param          │ item     │ region │ x1  │ x2  │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR"   │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK"   │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR"   │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK"   │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA     │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA     │ 0.2 │ 0.2 │

124

answered Oct 04 '22 14:10

Dan Getz

Ugly, quick and dirty "solution" (I hope it is doable in other way):

julia> df = DataFrame(
         permutedims(  # <- structural transpose
           vcat(
             reshape([j for i in keys(t) for j in i], :, length(t)) , 
             reshape([j for i in t       for j in i], :, length(t))
           ), 
           (2,1)
         )
       )
6×5 DataFrames.DataFrame
│ Row │ x1             │ x2       │ x3   │ x4  │ x5  │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1   │ "price"        │ "apple"  │ "FR" │ 1.1 │ 1.2 │
│ 2   │ "price"        │ "apple"  │ "UK" │ 0.8 │ 0.8 │
│ 3   │ "price"        │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4   │ "price"        │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5   │ "waterContent" │ "apple"  │ NA   │ 0.7 │ 0.8 │
│ 6   │ "waterContent" │ "banana" │ NA   │ 0.2 │ 0.2 │

answered Oct 04 '22 16:10

Liso

Related questions
                            
                                Mailgun API: "'from' parameter is not a valid address. please check documentation"
                            
                                Request method 'GET' not supported in Spring Boot Controller
                            
                                Center only one of two elements in a div? [duplicate]
                            
                                No supported authentication methods available (server sent: publickey)
                            
                                How to read the output of show operator back to a Dataset?
                            
                                Inserting encapsuled object in firestore and goods practices
                            
                                Alternative to scipy.cluster.hierarchy.cut_tree()
                            
                                UDP socket via JavaScript in a browser
                            
                                Slow rendering of the most basic layout
                            
                                enzyme shallow rendering just one node in redux components
                            
                                Spyder cannot launch: spyder: command not found
                            
                                How to check if a python class or object is user defined (not a builtin)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With