I have a column called Names
which looks like this, I need to compare it other column in a different panda dataframe which has the last name and first name but not the initials like this one. I am trying to split the initials out of the column in a new column, using space as delimiter, but will probably need to do it for the whole string. I tried this:
transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))
and it gives me this error
"ValueError: need more than 1 value to unpack"
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
8 BELFER ROBERT
Any ideas on how to do this.
Use the vectorised str.split
with expand=True
, this will unpack the list into the new cols:
In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df
Out[17]:
name lastname firstname middle initial
index
0 ALLEN PHILLIP K ALLEN PHILLIP K
1 BADUM JAMES P BADUM JAMES P
2 BANNANTINE JAMES M BANNANTINE JAMES M
8 BELFER ROBERT BELFER ROBERT None
You can use DataFrame
constructor and if you need delete original column drop
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
#if you want delete original column
df = df.drop('Names', axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT None
Timings: len(df) = 10000*4
df = pd.concat([df]*10000).reset_index(drop=True)
print df.head()
def jez(df):
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
return df
def edc(df):
df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
return df
print jez(df).head()
print edc(df).head()
My is fastest as Edchum
's solution if dataframe is larger:
In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop
In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop
EDIT by comment error:
Problem is with data, that contains 3 separators instead 2, so you need split them to four columns and then delete temporary column tmp
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P tttt
2 BANNANTINE JAMES M
df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
Names lastname firstname middle initial tmp
0 ALLEN PHILLIP K ALLEN PHILLIP K None
1 BADUM JAMES P tttt BADUM JAMES P tttt
2 BANNANTINE JAMES M BANNANTINE JAMES M None
#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With