Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a column into 3 columns in pandas

I have a column called Names which looks like this, I need to compare it other column in a different panda dataframe which has the last name and first name but not the initials like this one. I am trying to split the initials out of the column in a new column, using space as delimiter, but will probably need to do it for the whole string. I tried this:

transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))

and it gives me this error

"ValueError: need more than 1 value to unpack"

0                    ALLEN PHILLIP K
1                      BADUM JAMES P
2                 BANNANTINE JAMES M
8                      BELFER ROBERT

Any ideas on how to do this.

like image 214
Amit Singh Parihar Avatar asked Jan 30 '16 17:01

Amit Singh Parihar


2 Answers

Use the vectorised str.split with expand=True, this will unpack the list into the new cols:

In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df

Out[17]:
                     name    lastname firstname middle initial
index                                                         
0         ALLEN PHILLIP K       ALLEN   PHILLIP              K
1           BADUM JAMES P       BADUM     JAMES              P
2      BANNANTINE JAMES M  BANNANTINE     JAMES              M
8           BELFER ROBERT      BELFER    ROBERT           None
like image 110
EdChum Avatar answered Nov 09 '22 06:11

EdChum


You can use DataFrame constructor and if you need delete original column drop:

print df
                Names
0     ALLEN PHILLIP K
1       BADUM JAMES P
2  BANNANTINE JAMES M
3       BELFER ROBERT

df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])

#if you want delete original column
df = df.drop('Names', axis=1)
print df
     lastname firstname middle initial
0       ALLEN   PHILLIP              K
1       BADUM     JAMES              P
2  BANNANTINE     JAMES              M
3      BELFER    ROBERT           None

Timings: len(df) = 10000*4

df =  pd.concat([df]*10000).reset_index(drop=True)   

print df.head()

def jez(df):
    df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
    return df

def edc(df):
    df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
    return df

print jez(df).head()
print edc(df).head()   

My is fastest as Edchum's solution if dataframe is larger:

In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop

In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop

EDIT by comment error:

Problem is with data, that contains 3 separators instead 2, so you need split them to four columns and then delete temporary column tmp:

print df
                Names
0     ALLEN PHILLIP K
1  BADUM JAMES P tttt
2  BANNANTINE JAMES M

df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
                Names    lastname firstname middle initial   tmp
0     ALLEN PHILLIP K       ALLEN   PHILLIP              K  None
1  BADUM JAMES P tttt       BADUM     JAMES              P  tttt
2  BANNANTINE JAMES M  BANNANTINE     JAMES              M  None

#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
     lastname firstname middle initial
0       ALLEN   PHILLIP              K
1       BADUM     JAMES              P
2  BANNANTINE     JAMES              M
like image 44
jezrael Avatar answered Nov 09 '22 06:11

jezrael