I have a data frame that is as follows:
Honda [edit]
Accord (4 models)
Civic (4 models)
Pilot (3 models)
Toyota [edit]
Prius (4 models)
Highlander (3 models)
Ford [edit]
Explorer (2 models)
I am looking to reshape it such that I get a resulting 2 column data frame as follows:
Honda Accord
Honda Civic
Honda Pilot
Toyota Prius
Toyota Highlander
and so on. I tried str.split trying to split between edits, but was not successful. Any suggestions are most appreciated! Python newbie here...so apologies if this has been addressed before. Thanks!
So far I tried
maker=car['T'].str.extract('(.*\[edit\])', expand=False).str.replace('\[edit\]',"")
This gives me the list of Makers: Honda, Toyota and Ford. However I am stuck at finding a way to extract the models between the makers to create the 2 col DF.
The trick is to extract the car column first, then to get the maker.
import pandas as pd
import numpy as np
df['model'] = df['T'].apply(lambda x: x.split(
'(')[0].strip() if x.count('(') > 0 else np.NaN)
df['maker'] = df['T'].apply(lambda x: x.split('[')[0].strip(
) if x.count('[') > 0 else np.NaN).fillna(method="ffill")
df = df.dropna().drop('T', axis=1).reindex(
columns=['maker', 'model']).reset_index(drop=True)
The first line of the code extracts all the cars by using split and strip string operations if the entry contained '('
, it assigns NaN
otherwise, we use NaN
so that we can delete those rows after finding the makers.
At this stage the data frame df
will be:
+----+-----------------------+------------+
| | T | model |
|----+-----------------------+------------|
| 0 | Honda [edit] | nan |
| 1 | Accord (4 models) | Accord |
| 2 | Civic (4 models) | Civic |
| 3 | Pilot (3 models) | Pilot |
| 4 | Toyota [edit] | nan |
| 5 | Prius (4 models) | Prius |
| 6 | Highlander (3 models) | Highlander |
| 7 | Ford [edit] | nan |
| 8 | Explorer (2 models) | Explorer |
+----+-----------------------+------------+
The second line does the same but for '['
records, here the NaNs
are used to fill forward the empty maker cells using fillna
At this stage the data frame df
will be:
+----+-----------------------+------------+---------+
| | T | model | maker |
|----+-----------------------+------------+---------|
| 0 | Honda [edit] | nan | Honda |
| 1 | Accord (4 models) | Accord | Honda |
| 2 | Civic (4 models) | Civic | Honda |
| 3 | Pilot (3 models) | Pilot | Honda |
| 4 | Toyota [edit] | nan | Toyota |
| 5 | Prius (4 models) | Prius | Toyota |
| 6 | Highlander (3 models) | Highlander | Toyota |
| 7 | Ford [edit] | nan | Ford |
| 8 | Explorer (2 models) | Explorer | Ford |
+----+-----------------------+------------+---------+
The third line drops the extra records and rearrange the columns as well as reset the index
| | maker | model |
|----+---------+------------|
| 0 | Honda | Accord |
| 1 | Honda | Civic |
| 2 | Honda | Pilot |
| 3 | Toyota | Prius |
| 4 | Toyota | Highlander |
| 5 | Ford | Explorer |
EDIT:
A more "pandorable" version (I am fond of one liners)
df = df['T'].str.extractall('(.+)\[|(.+)\(').apply(
lambda x: x.ffill()
if x.name==0
else x).dropna(subset=[1]).reset_index(
drop=True).rename(columns={1:'Model',0:'Maker'})
the above works as follows
extractall
will return a DataFrame with two columns; column 0
corresponding to the group in the regex extracted using the first group'(.+)\['
i.e. the maker records ending with; and column 1
, corresponding to the second group i.e. '(.+)\('
, apply
is used to iterate through the columns, the column named 0
will be modified to propagate the 'Maker' values forward via ffill
and column 1
will be left as is. dropna
is then used with subset 1
to remove all rows where the value in column 1
is NaN
, reset_index
is used to drop the mult-index extractall
generates. finally the columns are renamed using rename
and a correspondence dictionary
Another one liner (func ;))
df['T'].apply(lambda line: [line.split('[')[0],None] if line.count('[')
else [None,line.split('(')[0].strip()]
).apply(pd.Series
).rename(
columns={0:'Maker',1:'Model'}
).apply(
lambda col: col.ffill() if col.name == 'Maker'
else col).dropna(
subset=['Model']
).reset_index(drop=True)
You can use extract
with ffill
. Then remove rows which contains [edit]
by boolean indexing
and mask by str.contains
, then reset_index
for create unique index
and last remove original column col
by drop
:
df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
model type
0 Honda Accord
1 Honda Civic
2 Honda Pilot
3 Toyota Prius
4 Toyota Highlander
5 Ford Explorer
Another solution use extract
and where
for create new column by condition and last use boolean indexing
again:
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df['model'] = df['type'].where(df.col.str.contains('\[edit\]')).ffill()
df = df[df.type != df.model].reset_index(drop=True).drop('col', axis=1)
print (df)
type model
0 Accord Honda
1 Civic Honda
2 Pilot Honda
3 Prius Toyota
4 Highlander Toyota
5 Explorer Ford
EDIT:
If need type
with spaces
in text, use replace
all values from ( to the end, also remove spaces by s\+
:
print (df)
col
0 Honda [edit]
1 Accord (4 models)
2 Civic (4 models)
3 Pilot (3 models)
4 Toyota [edit]
5 Prius (4 models)
6 Highlander (3 models)
7 Ford [edit]
8 Ford Expedition XL (2 models)
df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.replace(r'\s+\(.+$', '')
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
model type
0 Honda Accord
1 Honda Civic
2 Honda Pilot
3 Toyota Prius
4 Toyota Highlander
5 Ford Ford Expedition XL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With