I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year
and quarter
. I'd like to create a variable called period
that makes Year = 2000
and quarter= q2
into 2000q2
.
Can anyone help with that?
By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame.
If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"]
If you need to join multiple string columns, you can use agg
:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
df.Year.str.cat(df.quarter)
df['Year'].astype(str) + df['quarter']
UPDATE: Timing graph Pandas 0.23.4
Let's test it on 200K rows DF:
In [250]: df Out[250]: Year quarter 0 2014 q1 1 2015 q2 In [251]: df = pd.concat([df] * 10**5) In [252]: df.shape Out[252]: (200000, 2)
UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 131 ms per loop In [106]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 161 ms per loop In [108]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 189 ms per loop In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 567 ms per loop In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 584 ms per loop In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 24.7 s per loop
Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 53.3 ms per loop In [114]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 65.5 ms per loop In [115]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 79.9 ms per loop In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 9.38 s per loop
Answer contribution by @anton-vbr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With