I have a dataset in the following format:
county area pop_2006 pop_2007 pop_2008
01001 275 1037 1052 1102
01003 394 2399 2424 2438
01005 312 1638 1647 1660
And I need it in a format like this:
county year pop area
01001 2006 1037 275
01001 2007 1052 275
01001 2008 1102 275
01003 2006 2399 394
01003 2007 2424 394
...
I've tried every combination of pivot_table, stack, unstack, wide_to_long that I can think of, with no success yet. (clearly I'm mostly illiterate in Python/pandas, so please be gentle...).
You can use the following basic syntax to convert a pandas DataFrame from a wide format to a long format: df = pd. melt(df, id_vars='col1', value_vars=['col2', 'col3', ...]) In this scenario, col1 is the column we use as an identifier and col2, col3, etc.
The pyarrow library is able to construct a pandas. DataFrame faster than using pandas.
Return a copy of the array collapsed into one dimension. Whether to flatten in C (row-major), Fortran (column-major) order, or preserve the C/Fortran ordering from a . The default is 'C'.
You can use melt
for reshaping, then split
column variable
and drop
and sort_values
. I think you can cast column year
to int
by astype
and last change order of columns by subset
:
df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
county year pop area
0 1001 2006 1037 275
3 1001 2007 1052 275
6 1001 2008 1102 275
1 1003 2006 2399 394
4 1003 2007 2424 394
7 1003 2008 2438 394
2 1005 2006 1638 312
5 1005 2007 1647 312
8 1005 2008 1660 312
print (df1.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Another solution with set_index
, stack
and reset_index
:
df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]
print (df2)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
print (df2.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
As the question title suggests, we can use pd.wide_to_long
:
res = pd.wide_to_long(df, stubnames="pop", i=["county", "area"], j="year", sep="_")
to get
pop
county area year
1001 275 2006 1037
2007 1052
2008 1102
1003 394 2006 2399
2007 2424
2008 2438
1005 312 2006 1638
2007 1647
2008 1660
To exactly match the output format in the question, a reset_index
and reindex
(over columns) can be chained:
>>> res.reset_index().reindex(["county", "year", "pop", "area"], axis=1)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With