I have read an xls file into Python with pandas using pd.read_excel
I am trying to cleanup my data but I'm way out of my league.
There is a blank line between every record. In the example pic it's excel row 4, 9 and 11.
There is a comments column, designated in the example (see link) as "col_F". Every record has at least one cell that contains text. The person that created this xls file split longer comments into multiple cells.
I would like to concatenate all of the data in col_F for a particular record into one cell.
I will also trim out blank records once I figure out how to properly concatenate col_F.
I am using Python version 3.5.0, numpy 1.12.0 and pandas 0.19.2
Here is what I have so far:
import numpy as np
import pandas as pd
data = pd.read_excel("C:/blah/blahblah/file.xls", header=0, nrows=10000)
df = pd.DataFrame(data)
I appreciate any suggestion or insight!!
Thanks!
How the raw data looks:
Update:
This is how my data looks when I load into Python
This is what my desired outcome would look like:
Figured it out!!! Big shout out to the Norfolk Data Science Club for the helping hand.
Import pandas, sqlite and os packages
import pandas as pd
import sqlite3
import os
Specify filepaths and filenames
filepath = "C:/blah/blahblah/randomfolder"
filename = "raw_data.xlsx"
filePathFileName = filepath + "/" + filename
outputPathFileName = filepath + "/cleaned_" + filename
outputFileName = "cleaned_" + filename
Use pandas to read in excel file.
df = pd.read_excel(filePathFileName, header=0, nrows=14)
remove blank rows
df.dropna(how='all', inplace=True)
fill in the gaps in our data
df.ffill(inplace=True)
create a sqlite database and a connection to the sqlite database
con = sqlite3.connect(":memory:")
con.isolation_level = None
cur = con.cursor()
create a table for our data in sqlite
df.to_sql('example_data', con)
SQL query to aggregate our data
df_pdsql = pd.read_sql_query("SELECT col_A, col_B, col_C, col_D, col_E, GROUP_CONCAT(col_F, ' ') AS col_F FROM example_data GROUP BY col_A", con)
write our df to xlsx file
df_pdsql.to_excel(outputPathFileName, sheet_name='test', index=False)
let user know where file is located
print("Your new file is located in: " + outputPathFileName)
close connection to sqlite database
con.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With