Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas rearranging a data frame

Tags:

python

pandas

I have a data frame that is as follows:

Honda [edit]
Accord (4 models)
Civic  (4 models)
Pilot  (3 models)
Toyota [edit]
Prius  (4 models)
Highlander (3 models)
Ford [edit]
Explorer (2 models)

I am looking to reshape it such that I get a resulting 2 column data frame as follows:

 Honda     Accord
 Honda     Civic
 Honda     Pilot
 Toyota    Prius
 Toyota    Highlander

and so on. I tried str.split trying to split between edits, but was not successful. Any suggestions are most appreciated! Python newbie here...so apologies if this has been addressed before. Thanks!

So far I tried

     maker=car['T'].str.extract('(.*\[edit\])', expand=False).str.replace('\[edit\]',"")

This gives me the list of Makers: Honda, Toyota and Ford. However I am stuck at finding a way to extract the models between the makers to create the 2 col DF.

like image 310
FlyingPickle Avatar asked Jan 04 '17 06:01

FlyingPickle


2 Answers

The trick is to extract the car column first, then to get the maker.

import pandas as pd
import numpy as np

df['model'] = df['T'].apply(lambda x: x.split(
    '(')[0].strip() if x.count('(') > 0 else np.NaN)

df['maker'] = df['T'].apply(lambda x: x.split('[')[0].strip(
) if x.count('[') > 0 else np.NaN).fillna(method="ffill")

df = df.dropna().drop('T', axis=1).reindex(
    columns=['maker', 'model']).reset_index(drop=True)

The first line of the code extracts all the cars by using split and strip string operations if the entry contained '(', it assigns NaN otherwise, we use NaN so that we can delete those rows after finding the makers. At this stage the data frame df will be:

+----+-----------------------+------------+
|    | T                     | model      |
|----+-----------------------+------------|
|  0 | Honda [edit]          | nan        |
|  1 | Accord (4 models)     | Accord     |
|  2 | Civic  (4 models)     | Civic      |
|  3 | Pilot  (3 models)     | Pilot      |
|  4 | Toyota [edit]         | nan        |
|  5 | Prius  (4 models)     | Prius      |
|  6 | Highlander (3 models) | Highlander |
|  7 | Ford [edit]           | nan        |
|  8 | Explorer (2 models)   | Explorer   |
+----+-----------------------+------------+

The second line does the same but for '[' records, here the NaNs are used to fill forward the empty maker cells using fillna At this stage the data frame df will be:

+----+-----------------------+------------+---------+
|    | T                     | model      | maker   |
|----+-----------------------+------------+---------|
|  0 | Honda [edit]          | nan        | Honda   |
|  1 | Accord (4 models)     | Accord     | Honda   |
|  2 | Civic  (4 models)     | Civic      | Honda   |
|  3 | Pilot  (3 models)     | Pilot      | Honda   |
|  4 | Toyota [edit]         | nan        | Toyota  |
|  5 | Prius  (4 models)     | Prius      | Toyota  |
|  6 | Highlander (3 models) | Highlander | Toyota  |
|  7 | Ford [edit]           | nan        | Ford    |
|  8 | Explorer (2 models)   | Explorer   | Ford    |
+----+-----------------------+------------+---------+

The third line drops the extra records and rearrange the columns as well as reset the index

|    | maker   | model      |
|----+---------+------------|
|  0 | Honda   | Accord     |
|  1 | Honda   | Civic      |
|  2 | Honda   | Pilot      |
|  3 | Toyota  | Prius      |
|  4 | Toyota  | Highlander |
|  5 | Ford    | Explorer   |

EDIT:

A more "pandorable" version (I am fond of one liners)

df = df['T'].str.extractall('(.+)\[|(.+)\(').apply(
    lambda x: x.ffill() 
    if x.name==0 
    else x).dropna(subset=[1]).reset_index(
    drop=True).rename(columns={1:'Model',0:'Maker'})

the above works as follows extractall will return a DataFrame with two columns; column 0 corresponding to the group in the regex extracted using the first group'(.+)\[' i.e. the maker records ending with; and column 1, corresponding to the second group i.e. '(.+)\(', apply is used to iterate through the columns, the column named 0 will be modified to propagate the 'Maker' values forward via ffill and column 1 will be left as is. dropna is then used with subset 1 to remove all rows where the value in column 1 is NaN, reset_index is used to drop the mult-index extractall generates. finally the columns are renamed using rename and a correspondence dictionary

enter image description here

Another one liner (func ;))

 df['T'].apply(lambda line: [line.split('[')[0],None] if line.count('[') 
                       else [None,line.split('(')[0].strip()]
              ).apply(pd.Series
                      ).rename(
                            columns={0:'Maker',1:'Model'}
                        ).apply(
                         lambda col: col.ffill() if col.name == 'Maker' 
                         else col).dropna(
                                    subset=['Model']
                                    ).reset_index(drop=True)
like image 153
sgDysregulation Avatar answered Oct 02 '22 13:10

sgDysregulation


You can use extract with ffill. Then remove rows which contains [edit] by boolean indexing and mask by str.contains, then reset_index for create unique index and last remove original column col by drop:

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
     model        type
0   Honda       Accord
1   Honda        Civic
2   Honda        Pilot
3  Toyota        Prius
4  Toyota   Highlander
5    Ford     Explorer

Another solution use extract and where for create new column by condition and last use boolean indexing again:

df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df['model'] = df['type'].where(df.col.str.contains('\[edit\]')).ffill()
df = df[df.type != df.model].reset_index(drop=True).drop('col', axis=1)
print (df)
         type   model
0      Accord   Honda
1       Civic   Honda
2       Pilot   Honda
3       Prius  Toyota
4  Highlander  Toyota
5    Explorer    Ford

EDIT:

If need type with spaces in text, use replace all values from ( to the end, also remove spaces by s\+:

print (df)
                             col
0                   Honda [edit]
1              Accord (4 models)
2              Civic  (4 models)
3              Pilot  (3 models)
4                  Toyota [edit]
5              Prius  (4 models)
6          Highlander (3 models)
7                    Ford [edit]
8  Ford Expedition XL (2 models)

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.replace(r'\s+\(.+$', '')
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
     model                type
0   Honda               Accord
1   Honda                Civic
2   Honda                Pilot
3  Toyota                Prius
4  Toyota           Highlander
5    Ford   Ford Expedition XL
like image 26
jezrael Avatar answered Oct 02 '22 13:10

jezrael