Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean up excel data with python using pandas package

Tags:

I have read an xls file into Python with pandas using pd.read_excel

I am trying to cleanup my data but I'm way out of my league.

There is a blank line between every record. In the example pic it's excel row 4, 9 and 11.

There is a comments column, designated in the example (see link) as "col_F". Every record has at least one cell that contains text. The person that created this xls file split longer comments into multiple cells.

I would like to concatenate all of the data in col_F for a particular record into one cell.

I will also trim out blank records once I figure out how to properly concatenate col_F.

I am using Python version 3.5.0, numpy 1.12.0 and pandas 0.19.2

Here is what I have so far:

import numpy as np
import pandas as pd

data = pd.read_excel("C:/blah/blahblah/file.xls", header=0, nrows=10000)
df = pd.DataFrame(data)

I appreciate any suggestion or insight!!

Thanks!

How the raw data looks: enter image description here

Update: This is how my data looks when I load into Python raw data read into python

This is what my desired outcome would look like: what I would like dataframe to look like when finished

like image 280
Tommy Avatar asked Feb 14 '17 19:02

Tommy


1 Answers

Figured it out!!! Big shout out to the Norfolk Data Science Club for the helping hand.

Import pandas, sqlite and os packages

import pandas as pd
import sqlite3
import os

Specify filepaths and filenames

filepath = "C:/blah/blahblah/randomfolder"
filename = "raw_data.xlsx"
filePathFileName = filepath + "/" + filename
outputPathFileName = filepath + "/cleaned_" + filename
outputFileName = "cleaned_" + filename

Use pandas to read in excel file.

df = pd.read_excel(filePathFileName, header=0, nrows=14)

remove blank rows

df.dropna(how='all', inplace=True)

fill in the gaps in our data

df.ffill(inplace=True)

create a sqlite database and a connection to the sqlite database

con = sqlite3.connect(":memory:")
con.isolation_level = None
cur = con.cursor()

create a table for our data in sqlite

df.to_sql('example_data', con)

SQL query to aggregate our data

df_pdsql = pd.read_sql_query("SELECT col_A, col_B, col_C, col_D, col_E, GROUP_CONCAT(col_F, ' ') AS col_F FROM example_data GROUP BY col_A", con)

write our df to xlsx file

df_pdsql.to_excel(outputPathFileName, sheet_name='test', index=False)

let user know where file is located

print("Your new file is located in: " + outputPathFileName)

close connection to sqlite database

con.close()
like image 152
Tommy Avatar answered Sep 22 '22 10:09

Tommy