I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.
Can anyone help with that?
By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame.
If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"] If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"] If you need to join multiple string columns, you can use agg:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1) Where "-" is the separator.
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])] or slightly slower but more compact:
df.Year.str.cat(df.quarter) df['Year'].astype(str) + df['quarter'] UPDATE: Timing graph Pandas 0.23.4

Let's test it on 200K rows DF:
In [250]: df Out[250]: Year quarter 0 2014 q1 1 2015 q2 In [251]: df = pd.concat([df] * 10**5) In [252]: df.shape Out[252]: (200000, 2) UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 131 ms per loop In [106]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 161 ms per loop In [108]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 189 ms per loop In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 567 ms per loop In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 584 ms per loop In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 24.7 s per loop Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 53.3 ms per loop In [114]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 65.5 ms per loop In [115]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 79.9 ms per loop In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 9.38 s per loop Answer contribution by @anton-vbr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With