Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pass openpyxl data to pandas

I am splitting "full name" fields into "first name", middle name" and "last name" fields from data from an excel file. I couldn't figure out how to do that in pandas, so I turned to openpyxl. I got the variables split as I desired. But, since adding columns to openpyxl for the new fields is not easy, I thought I would pass the values to pandas.

I'm generating the dataframe that I need when I run the code, but once I send the df to ExcelWriter, only the last row is added to the Excel file. The data is in the right places, though.

Here's the code:

for cellObj in range(2, sheet.max_row+1):
    #print cellObj
    id = sheet['A' + str(cellObj)].value
    fullname = sheet['B' + str(cellObj)].value.strip()
    namelist = fullname.split(' ')  
    for i in namelist:
        firstname = namelist[0]
        if len(namelist) == 2:
            lastname = namelist[1]
            middlename = ''
        elif len(namelist) == 3:
            middlename = namelist[1]
            lastname = namelist[2]
        elif len(namelist) == 4:
            middlename = namelist[1]
            lastname = namelist[2] + " " + namelist[3]
        if (namelist[1] == 'Del') | (namelist[1] == 'El') | (namelist[1] == 'Van'):
            middlename = ''
            lastname = namelist[1] + " " + namelist[2]
    df = pd.DataFrame({'personID':id,'lastName':lastname,'firstName':firstname,'middleName':middlename}, index=[id])

    writer = pd.ExcelWriter('output.xlsx')
    df.to_excel(writer,'Sheet1', columns=['ID','lastName','firstName','middleName'])
    writer.save()

Any ideas?

Thanks

like image 447
mattrweaver Avatar asked Mar 12 '23 14:03

mattrweaver


2 Answers

A couple of things. First, your code is only ever going to get you one line, because you overwrite the values every time it passes an if test. for example,

  if len(namelist) == 2:
        lastname = namelist[1]

This assigns a string to the variable lastname. You are not appending to a list, you are just assigning a string. Then when you make your dataframe, df = pd.DataFrame({'personID':id,'lastName':lastname,... your using this value, so the dataframe will only ever hold that string. Make sense? If you must do this using openpyexcel, try something like:

lastname = [] #create an empty list
if len(namelist) == 2:
    lastname.append(namelist[1]) #add the name to the list

However, I think your life will ultimately be much easier if you just figure out how to do this with pandas. It is in fact quite easy. Try something like this:

import pandas as pd
#read excel
df = pd.read_excel('myInputFilename.xlsx', encoding = 'utf8')
#write to excel
df.to_excel('MyOutputFile.xlsx')
like image 199
Sam Avatar answered Mar 16 '23 02:03

Sam


FWIW openpyxl 2.4 makes it pretty easy to convert all or part of an Excel sheet to a Pandas Dataframe: ws.values is an iterator for all that values in the sheet. It also has a new ws.iter_cols() method that will allow you to work directly with columns.

It's currently (April 2016) available as an alpha version and can be installed using pip install -U --pre openpyxl

The code would then look a bit like this:

sheet["B1"] = "firstName"
sheet["C1"] = "middleName"
sheet["D1"] = "lastName"

for row in sheet.iter_rows(min_row=2, max_col=2):
    id_cell, name = row

    fullname = name.value.strip()
    namelist = fullname.split()
    firstname = namelist[0]
    lastname = namelist[-1]
    middlename = ""
    if len(namelist) >= 3:
        middlename = namelist[1]
    if len(namelist) == 4:
        lastname = " ".join(namelist[-2:])
    if middlename in ('Del', 'El', 'Van', 'Da'):
        lastname = " ".join([middlename, lastname])
        middlename = None

    name.value = firstname
    name.offset(column=1).value = middlename
    name.offset(column=2).value = lastname

wb.save("output.xlsx")
like image 30
Charlie Clark Avatar answered Mar 16 '23 03:03

Charlie Clark