I have a single .csv
file containing multiple tables.
Using Pandas, what would be the best strategy to get two DataFrame inventory
and HPBladeSystemRack
from this one file ?
The input .csv
looks like this:
Inventory
System Name IP Address System Status
dg-enc05 Normal
dg-enc05_vc_domain Unknown
dg-enc05-oa1 172.20.0.213 Normal
HP BladeSystem Rack
System Name Rack Name Enclosure Name
dg-enc05 BU40
dg-enc05-oa1 BU40 dg-enc05
dg-enc05-oa2 BU40 dg-enc05
The best I've come up with so far is to convert this .csv
file into Excel workbook (xlxs
), split the tables into sheets and use:
inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)
However:
xlrd
module.There's technically nothing stopping you from having more than one table, using a format you decide on and implement and keep consistent. For instance, you could parse the file yourself and use a line with 5 hyphens to designate a separate table.
You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files. Here, I will use read_csv() to read CSV files and concat() function to concatenate DataFrams together to create one big DataFrame.
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csv
function. It only requires table names and will output .csv
files named after each table.
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
def split_csv(csv_path, table_names):
tables_infos = detect_tables_from_csv(csv_path, table_names)
for table_info in tables_infos:
split_csv_by_indexes(csv_path, table_info)
def split_csv_by_indexes(csv_path, table_info):
title, start_index, end_index = table_info
print title, start_index, end_index
dir_ = dirname(csv_path)
output_path = join(dir_, title) + ".csv"
with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
writer = csv.writer(output_file)
reader = csv.reader(input_file)
for i, line in enumerate(reader):
if i < start_index:
continue
if i > end_index:
break
writer.writerow(line)
def detect_tables_from_csv(csv_path, table_names):
output = []
with open(csv_path, 'rb') as csv_file:
reader = csv.reader(csv_file)
for idx, row in enumerate(reader):
for col in row:
match = [title for title in table_names if title in col]
if match:
match = match[0] # get the first matching element
try:
end_index = idx - 1
start_index
except NameError:
start_index = 0
else:
output.append((previous_match, start_index, end_index))
print "Found new table", col
start_index = idx
previous_match = match
match = False
end_index = idx # last 'end_index' set to EOF
output.append((previous_match, start_index, end_index))
return output
if __name__ == '__main__':
csv_path = 'switch_records.csv'
try:
split_csv(csv_path, table_names)
except IOError as e:
print "This file doesn't exist. Aborting."
print e
exit(1)
If you know the table names beforehand, then something like this:
df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
should work to produce a dictionary with keys as the table names and values as the subtables.
>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
... print("table:", k)
... print(v)
... print()
...
table: HP BladeSystem Rack
0 1 2
6 System Name Rack Name Enclosure Name
7 dg-enc05 BU40 NaN
8 dg-enc05-oa1 BU40 dg-enc05
9 dg-enc05-oa2 BU40 dg-enc05
table: Inventory
0 1 2
1 System Name IP Address System Status
2 dg-enc05 NaN Normal
3 dg-enc05_vc_domain NaN Unknown
4 dg-enc05-oa1 172.20.0.213 Normal
Once you've got that, you can set the column names to the first rows, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With