I have a DataFrame that has a column that looks like:
Japan
valA
valB
Ghana
valC
valD
...
I want to extract the country names from this list and turn them into another column like so:
Japan valA
Japan valB
Ghana valC
Ghana valD
I am sure there's an answer for this already on SO, but I haven't been able to find the correct keywords to bring it up.
Right now, I am doing the following, but I then have to drop rows that initially contained the country names:
def get_country(row):
if #decide if it's a country name:
return row[0]
df['country'] = df.apply(get_country, axis=1).fillna(method='ffill')
This seems like a fairly common use case when cleaning data, is there a standard/better way of doing this?
I can get you started using map
and ffill
.
def is_country(x):
# TODO - fill in the logic for this stub.
return x in {'Japan', 'Ghana'}
df
A
0 Japan
1 valA
2 valB
3 Ghana
4 valC
5 valD
df.assign(B=df['A'].where(df['A'].map(is_country)).ffill()).query('A != B')
A B
1 valA Japan
2 valB Japan
4 valC Ghana
5 valD Ghana
You can use a package like pycountry
(or something similar) to validate country names.
import pycountry
countries = {x.name for x in pycountry.countries} # Initialise a set.
def is_country(x):
return x in countries
Although, with this definition, you can simplify your code to,
df.assign(B=df['A'].where(df['A'].isin(countries)).ffill()).query('A != B')
And get rid of the is_country
function entirely.
Using extract
new_df = df['col'].str.extract('(val.*)?(.*)').replace('', np.nan).rename(columns = {1:'Country', 0:'Value'})
new_df['Country'] = new_df['Country'].ffill()
new_df.dropna(inplace = True)
Value Country
1 valA Japan
2 valB Japan
4 valC Ghana
5 valD Ghana
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With