Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine two columns of text in pandas dataframe

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.

Can anyone help with that?

like image 248
user2866103 Avatar asked Oct 15 '13 09:10

user2866103


People also ask

How do I merge two columns of text in pandas?

By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.

How do I concatenate two columns in a data frame?

By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame.


2 Answers

If both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"] 

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"] 

Beware of NaNs when doing this!


If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1) 

Where "-" is the separator.

like image 69
silvado Avatar answered Oct 04 '22 23:10

silvado


Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])] 

or slightly slower but more compact:

df.Year.str.cat(df.quarter) 

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter'] 

UPDATE: Timing graph Pandas 0.23.4

enter image description here

Let's test it on 200K rows DF:

In [250]: df Out[250]:    Year quarter 0  2014      q1 1  2015      q2  In [251]: df = pd.concat([df] * 10**5)  In [252]: df.shape Out[252]: (200000, 2) 

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 131 ms per loop  In [106]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 161 ms per loop  In [108]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 189 ms per loop  In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 567 ms per loop  In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 584 ms per loop  In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 24.7 s per loop 

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter'] 10 loops, best of 3: 53.3 ms per loop  In [114]: %timeit df['Year'].map(str) + df['quarter'] 10 loops, best of 3: 65.5 ms per loop  In [115]: %timeit df.Year.str.cat(df.quarter) 10 loops, best of 3: 79.9 ms per loop  In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop  In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1) 1 loop, best of 3: 230 ms per loop  In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1) 1 loop, best of 3: 9.38 s per loop 

Answer contribution by @anton-vbr

like image 24
MaxU - stop WAR against UA Avatar answered Oct 04 '22 23:10

MaxU - stop WAR against UA