I am splitting "full name" fields into "first name", middle name" and "last name" fields from data from an excel file. I couldn't figure out how to do that in pandas, so I turned to openpyxl. I got the variables split as I desired. But, since adding columns to openpyxl for the new fields is not easy, I thought I would pass the values to pandas.
I'm generating the dataframe that I need when I run the code, but once I send the df to ExcelWriter, only the last row is added to the Excel file. The data is in the right places, though.
Here's the code:
for cellObj in range(2, sheet.max_row+1):
#print cellObj
id = sheet['A' + str(cellObj)].value
fullname = sheet['B' + str(cellObj)].value.strip()
namelist = fullname.split(' ')
for i in namelist:
firstname = namelist[0]
if len(namelist) == 2:
lastname = namelist[1]
middlename = ''
elif len(namelist) == 3:
middlename = namelist[1]
lastname = namelist[2]
elif len(namelist) == 4:
middlename = namelist[1]
lastname = namelist[2] + " " + namelist[3]
if (namelist[1] == 'Del') | (namelist[1] == 'El') | (namelist[1] == 'Van'):
middlename = ''
lastname = namelist[1] + " " + namelist[2]
df = pd.DataFrame({'personID':id,'lastName':lastname,'firstName':firstname,'middleName':middlename}, index=[id])
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1', columns=['ID','lastName','firstName','middleName'])
writer.save()
Any ideas?
Thanks
A couple of things. First, your code is only ever going to get you one line, because you overwrite the values every time it passes an if test. for example,
if len(namelist) == 2:
lastname = namelist[1]
This assigns a string to the variable lastname
. You are not appending to a list, you are just assigning a string. Then when you make your dataframe,
df = pd.DataFrame({'personID':id,'lastName':lastname,...
your using this value, so the dataframe will only ever hold that string. Make sense? If you must do this using openpyexcel, try something like:
lastname = [] #create an empty list
if len(namelist) == 2:
lastname.append(namelist[1]) #add the name to the list
However, I think your life will ultimately be much easier if you just figure out how to do this with pandas. It is in fact quite easy. Try something like this:
import pandas as pd
#read excel
df = pd.read_excel('myInputFilename.xlsx', encoding = 'utf8')
#write to excel
df.to_excel('MyOutputFile.xlsx')
FWIW openpyxl 2.4 makes it pretty easy to convert all or part of an Excel sheet to a Pandas Dataframe: ws.values
is an iterator for all that values in the sheet. It also has a new ws.iter_cols()
method that will allow you to work directly with columns.
It's currently (April 2016) available as an alpha version and can be installed using pip install -U --pre openpyxl
The code would then look a bit like this:
sheet["B1"] = "firstName"
sheet["C1"] = "middleName"
sheet["D1"] = "lastName"
for row in sheet.iter_rows(min_row=2, max_col=2):
id_cell, name = row
fullname = name.value.strip()
namelist = fullname.split()
firstname = namelist[0]
lastname = namelist[-1]
middlename = ""
if len(namelist) >= 3:
middlename = namelist[1]
if len(namelist) == 4:
lastname = " ".join(namelist[-2:])
if middlename in ('Del', 'El', 'Van', 'Da'):
lastname = " ".join([middlename, lastname])
middlename = None
name.value = firstname
name.offset(column=1).value = middlename
name.offset(column=2).value = lastname
wb.save("output.xlsx")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With