Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a string in pandas and join it to the old data

Tags:

python

pandas

What I am doing seems simple, but I can not figure it out.

I have dataframe with data such as

City    State ZIP
Ames    IA    50011-3617
Ankeny  IA    50021

I want to split the zipcodes by - and save only the first ones in a new dataframe which has the old data and only the new zipcode. I tried to do the following.

data_short_zip = data
df = data['ZIP'].str.split('-').str[0]
data_short_zip.join(df)

This not only throws an error, but seems unpythonic. Is there a simple way to do this?

The output data would look like

City    State ZIP
Ames    IA    50011
Ankeny  IA    50021
like image 791
Jstuff Avatar asked Jul 19 '16 14:07

Jstuff


2 Answers

You can use str.split to split on your delimeter and then str[0] on the result to return the first split:

In [122]:
df['ZIP'] = df['ZIP'].str.split('-').str[0]
df

Out[122]:
     City State    ZIP
0    Ames    IA  50011
1  Ankeny    IA  50021
like image 104
EdChum Avatar answered Sep 22 '22 23:09

EdChum


Ultimately, you want to scrape those first 5 characters and reassign to data.ZIP. Here are some alternatives to scrape the first 5, all of which return the same thing.

0    50011
1    50021
Name: ZIP, dtype: object

data.ZIP.str.extract(r'^(\d{5})', expand=False)
data.ZIP.str[:5]
data.ZIP.str.split('-').str[0]
data.ZIP.str.split('-').str.get(0)

It's pretty clear to me ;-) data.ZIP.str[:5] is the winner.

Then just assign back to data.ZIP

data.ZIP = data.ZIP.str[:5]

enter image description here


Timing

Over the small 2 row sample, this is how fast they are:

enter image description here

Over that 2 row sample concatted 10000 times (20 Thousand rows)

data = pd.concat([data for _ in range(10000)])

enter image description here

Another 100 times (2 Million rows)

data = pd.concat([data for _ in range(100)])

enter image description here

like image 44
piRSquared Avatar answered Sep 25 '22 23:09

piRSquared