I am working with a csv file with 3 columns that looks like this:
timeStamp, value, label
15:22:57, 849, CPU pid=26298:percent
15:22:57, 461000, JMX MB
15:22:58, 28683, Disks I/O
15:22:58, 3369078, Memory pid=26298:unit=mb:resident
15:22:58, 0, JMX 31690:gc-time
15:22:58, 0, CPU pid=26298:percent
15:22:58, 503000, JMX MB
The label
column contains distinct values (say a total of 5), which include spaces, colons and other special characters.
What I am trying to achieve is to plot time against each metric (either on the same plot or on separate ones). I can do this with matplotlib
, but I first need to group the [timeStamps, value]
pairs according to the 'label'.
I looked into the csv.DictReader
to get the labels and the itertools.groupby
to group by the 'label', but I am struggling to do this in a proper 'pythonic' way.
Any suggestion?
You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.
Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
You don't need groupby
; you want to use collections.defaultdict
to collect series of [timestamp, value]
pairs keyed by label:
from collections import defaultdict
import csv
per_label = defaultdict(list)
with open(inputfilename, 'rb') as inputfile:
reader = csv.reader(inputfile)
next(reader, None) # skip the header row
for timestamp, value, label in reader:
per_label[label.strip()].append([timestamp.strip(), float(value)])
Now per_label
is a dictionary with labels as keys, and a list of [timestamp, value]
pairs as values; I've stripped off whitespace (your input sample has a lot of extra whitespace) and turned the value
column into floats.
For your (limited) input sample that results in:
{'CPU pid=26298:percent': [['15:22:57', 849.0], ['15:22:58', 0.0]],
'Disks I/O': [['15:22:58', 28683.0]],
'JMX 31690:gc-time': [['15:22:58', 0.0]],
'JMX MB': [['15:22:57', 461000.0], ['15:22:58', 503000.0]],
'Memory pid=26298:unit=mb:resident': [['15:22:58', 3369078.0]]}
You can try pandas which provide a nice structure to dealing with data.
Read the csv to the DataFrame
In [123]: import pandas as pd
In [124]: df = pd.read_csv('test.csv', skipinitialspace=True)
In [125]: df
Out[125]:
timeStamp value label
0 15:22:57 849 CPU pid=26298:percent
1 15:22:57 461000 JMX MB
2 15:22:58 28683 Disks I/O
3 15:22:58 3369078 Memory pid=26298:unit=mb:resident
4 15:22:58 0 JMX 31690:gc-time
5 15:22:58 0 CPU pid=26298:percent
6 15:22:58 503000 JMX MB
Group the DataFrame
by label
In [154]: g = df.groupby('label')
Now you can get what you want
In [155]: g.get_group('JMX MB')
Out[155]:
timeStamp value label
1 15:22:57 461000 JMX MB
6 15:22:58 503000 JMX MB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With