Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.

0                 HUN 1                 ESP 2                 GBR 3                 ESP 4                 FRA 5             ID, USA 6             GA, USA 7    Hoboken, NJ, USA 8             NJ, USA 9                 AUS 

Splitting the column into three columns is trivial enough:

location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(','))) 

However, this creates left-aligned data:

     0       1       2 0    HUN     NaN     NaN 1    ESP     NaN     NaN 2    GBR     NaN     NaN 3    ESP     NaN     NaN 4    FRA     NaN     NaN 5    ID      USA     NaN 6    GA      USA     NaN 7    Hoboken  NJ     USA 8    NJ      USA     NaN 9    AUS     NaN     NaN 

How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?

like image 885
jamesbev Avatar asked Apr 26 '14 22:04

jamesbev


People also ask

How do I split a single column into multiple columns in pandas?

Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

How do you split a DataFrame into multiple columns in Python?

split() Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.

How do you split items into multiple columns in a data frame?

We can use the pandas Series. str. split() function to break up strings in multiple columns around a given separator or delimiter. It's similar to the Python string split() method but applies to the entire Dataframe column.


2 Answers

I'd do something like the following:

foo = lambda x: pd.Series([i for i in reversed(x.split(','))]) rev = df['City, State, Country'].apply(foo) print rev        0    1        2 0   HUN  NaN      NaN 1   ESP  NaN      NaN 2   GBR  NaN      NaN 3   ESP  NaN      NaN 4   FRA  NaN      NaN 5   USA   ID      NaN 6   USA   GA      NaN 7   USA   NJ  Hoboken 8   USA   NJ      NaN 9   AUS  NaN      NaN 

I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:

rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True) rev = rev[['City','State','Country']] print rev       City State Country 0      NaN   NaN     HUN 1      NaN   NaN     ESP 2      NaN   NaN     GBR 3      NaN   NaN     ESP 4      NaN   NaN     FRA 5      NaN    ID     USA 6      NaN    GA     USA 7  Hoboken    NJ     USA 8      NaN    NJ     USA 9      NaN   NaN     AUS 
like image 51
Karl D. Avatar answered Sep 24 '22 04:09

Karl D.


Assume you have the column name as target

df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True) 
like image 27
Dolittle Wang Avatar answered Sep 23 '22 04:09

Dolittle Wang