I was following a brief tutorial on LinkedIn regarding multiindexed pandas dataframes where I was unable to reproduce a seemingly very basic operation (at 3:00). You DO NOT have to watch the video to grasp the problem.
The following snippet that uses a dataset from seaborn will show that I'm unable to add a column to a multiindexed pandas dataframe using the technique shown in the video, and also described in an SO post here.
Here we go:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed.unstack()
print(flights_unstack)
Output:
passengers
month January February March April May June July August September October November December
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 305 336
1957 315 301 356 348 355 422 465 467 404 347 310 337
1958 340 318 362 348 363 435 491 505 404 359 362 405
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
And now I'd like to append a column that shows the sum per month for each year using
flights_unstack.sum(axis = 1)
Output:
year
1949 1520
1950 1676
1951 2042
1952 2364
1953 2700
1954 2867
1955 3408
1956 3939
1957 4421
1958 4572
1959 5140
1960 5714
The two sources mentioned above demonstrate this by using something as simple as:
flights_unstack['passengers', 'total'] = flights_unstack.sum(axis = 1)
Here, 'total' should appear as a new column under the existing indexes. But I'm getting this:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I'm using Python 3, and so is the author in the video from 2015.
What's going on here?
I've made a bunch of attempts using only values from series above, as well as reshaping, transposing, merging and joining the data bot as pd.Series and pd.DataFrame. And resetting the indexes. I may have overlooked some important detail, and now I'm hoping for a suggestion from some of you.
EDIT 1 - Here's an attempt after the first suggestion from jezrael:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed['passengers'].unstack()
flights_unstack['total'] = flights_unstack.sum(axis = 1)
Output:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
Change:
flights_unstack = flights_indexed.unstack()
to:
flights_unstack = flights_indexed['passengers'].unstack()
for remove Multiindex
in columns.
And last is necessary add_categories
by new column name:
flights_unstack.columns = flights_unstack.columns.add_categories(['total'])
flights_unstack['total'] = flights_unstack.sum(axis = 1)
print (df)
January February March April May June July August September \
month
1949 112 118 132 129 121 135 148 148 136
1950 115 126 141 135 125 149 170 170 158
1951 145 150 178 163 172 178 199 199 184
1952 171 180 193 181 183 218 230 242 209
1953 196 196 236 235 229 243 264 272 237
1954 204 188 235 227 234 264 302 293 259
1955 242 233 267 269 270 315 364 347 312
1956 284 277 317 313 318 374 413 405 355
1957 315 301 356 348 355 422 465 467 404
1958 340 318 362 348 363 435 491 505 404
1959 360 342 406 396 420 472 548 559 463
1960 417 391 419 461 472 535 622 606 508
October November December total
month
1949 119 104 118 1520
1950 133 114 140 1676
1951 162 146 166 2042
1952 191 172 194 2364
1953 211 180 201 2700
1954 229 203 229 2867
1955 274 237 278 3408
1956 306 305 336 4003
1957 347 310 337 4427
1958 359 362 405 4692
1959 407 362 405 5140
1960 461 390 432 5714
Setup:
import pandas as pd
temp=u"""month;January;February;March;April;May;June;July;August;September;October;November;December
1949;112;118;132;129;121;135;148;148;136;119;104;118
1950;115;126;141;135;125;149;170;170;158;133;114;140
1951;145;150;178;163;172;178;199;199;184;162;146;166
1952;171;180;193;181;183;218;230;242;209;191;172;194
1953;196;196;236;235;229;243;264;272;237;211;180;201
1954;204;188;235;227;234;264;302;293;259;229;203;229
1955;242;233;267;269;270;315;364;347;312;274;237;278
1956;284;277;317;313;318;374;413;405;355;306;305;336
1957;315;301;356;348;355;422;465;467;404;347;310;337
1958;340;318;362;348;363;435;491;505;404;359;362;405
1959;360;342;406;396;420;472;548;559;463;407;362;405
1960;417;391;419;461;472;535;622;606;508;461;390;432"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", index_col=[0])
print (df)
df.columns = pd.CategoricalIndex(df.columns)
df.columns = df.columns.add_categories(['total'])
df['total'] = df.sum(axis = 1)
I know this is kind of late but I found the answer to your problem in the FAQs section of the course. Here's what it says:
"Q. What are the issues with Pandas categorical data?
A. Since version 0.6, seaborn.load_dataset converts certain columns to Pandas categorical data (see http://pandas.pydata.org/pandas-docs/stable/categorical.html). This creates a problem in the handling of the "flights" DataFrame used in "Introduction to Pandas/Using multilevel indices". To avoid the problem, you may load the dataset directly with Pandas:
flights = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv')"
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With