dataframes
in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby
operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row
which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex!
reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
To set a column as index for a DataFrame, use DataFrame. set_index() function, with the column name passed as argument. You can also setup MultiIndex with multiple columns in the index. In this case, pass the array of column names required for index, to set_index() method.
iloc selects rows based on an integer index. So, if you want to select the 5th row in a DataFrame, you would use df. iloc[[4]] since the first row is at index 0, the second row is at index 1, and so on.
Index is like an address, that's how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.
DataFrames.jl does not currently allow specifying an index for a data frame. The Row
column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex!
function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42
is equivalent to setindex!(x, 42, 1)
. Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex!
can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With