I am working with a csv file with 3 columns that looks like this: <pre class="prettyprint"><code>timeStamp, value, label 15:22:57, 849, CPU pid=26298:percent 15:22:57, 461000, JMX MB 15:22:58, 28683, Disks I/O 15:22:58, 3369078, Memory pid=26298:unit=mb:resident 15:22:58, 0, JMX 31690:gc-time 15:22:58, 0, CPU pid=26298:percent 15:22:58, 503000, JMX MB </code></pre> The <code>label</code> column contains distinct values (say a total of 5), which include spaces, colons and other special characters. What I am trying to achieve is to plot time against each metric (either on the same plot or on separate ones). I can do this with <code>matplotlib</code>, but I first need to group the <code>[timeStamps, value]</code> pairs according to the 'label'. I looked into the <code>csv.DictReader</code> to get the labels and the <code>itertools.groupby</code> to group by the 'label', but I am struggling to do this in a proper 'pythonic' way. Any suggestion?

You don't need <code>groupby</code>; you want to use <code>collections.defaultdict</code> to collect series of <code>[timestamp, value]</code> pairs keyed by label: <pre class="prettyprint"><code>from collections import defaultdict import csv per_label = defaultdict(list) with open(inputfilename, 'rb') as inputfile: reader = csv.reader(inputfile) next(reader, None) # skip the header row for timestamp, value, label in reader: per_label[label.strip()].append([timestamp.strip(), float(value)]) </code></pre> Now <code>per_label</code> is a dictionary with labels as keys, and a list of <code>[timestamp, value]</code> pairs as values; I've stripped off whitespace (your input sample has a lot of extra whitespace) and turned the <code>value</code> column into floats. For your (limited) input sample that results in: <pre class="prettyprint"><code>{'CPU pid=26298:percent': [['15:22:57', 849.0], ['15:22:58', 0.0]], 'Disks I/O': [['15:22:58', 28683.0]], 'JMX 31690:gc-time': [['15:22:58', 0.0]], 'JMX MB': [['15:22:57', 461000.0], ['15:22:58', 503000.0]], 'Memory pid=26298:unit=mb:resident': [['15:22:58', 3369078.0]]} </code></pre>

Python - reading a csv and grouping data by a column

Tags:

python

csv

I am working with a csv file with 3 columns that looks like this:

timeStamp, value, label
15:22:57, 849, CPU pid=26298:percent
15:22:57, 461000, JMX MB
15:22:58, 28683, Disks I/O
15:22:58, 3369078, Memory pid=26298:unit=mb:resident
15:22:58, 0, JMX 31690:gc-time
15:22:58, 0, CPU pid=26298:percent
15:22:58, 503000, JMX MB

The label column contains distinct values (say a total of 5), which include spaces, colons and other special characters.

What I am trying to achieve is to plot time against each metric (either on the same plot or on separate ones). I can do this with matplotlib, but I first need to group the [timeStamps, value] pairs according to the 'label'.

I looked into the csv.DictReader to get the labels and the itertools.groupby to group by the 'label', but I am struggling to do this in a proper 'pythonic' way.

Any suggestion?

323

asked Apr 25 '13 09:04

Argyrios Tzakas

2 Answers

You don't need groupby; you want to use collections.defaultdict to collect series of [timestamp, value] pairs keyed by label:

from collections import defaultdict
import csv

per_label = defaultdict(list)

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    next(reader, None)  # skip the header row

    for timestamp, value, label in reader:
        per_label[label.strip()].append([timestamp.strip(), float(value)])

Now per_label is a dictionary with labels as keys, and a list of [timestamp, value] pairs as values; I've stripped off whitespace (your input sample has a lot of extra whitespace) and turned the value column into floats.

For your (limited) input sample that results in:

{'CPU pid=26298:percent': [['15:22:57', 849.0], ['15:22:58', 0.0]],
 'Disks I/O': [['15:22:58', 28683.0]],
 'JMX 31690:gc-time': [['15:22:58', 0.0]],
 'JMX MB': [['15:22:57', 461000.0], ['15:22:58', 503000.0]],
 'Memory pid=26298:unit=mb:resident': [['15:22:58', 3369078.0]]}

112

answered Nov 12 '22 02:11

Martijn Pieters

You can try pandas which provide a nice structure to dealing with data.

Read the csv to the DataFrame

In [123]: import pandas as pd

In [124]: df = pd.read_csv('test.csv', skipinitialspace=True)

In [125]: df
Out[125]: 
  timeStamp    value                              label
0  15:22:57      849              CPU pid=26298:percent
1  15:22:57   461000                             JMX MB
2  15:22:58    28683                          Disks I/O 
3  15:22:58  3369078  Memory pid=26298:unit=mb:resident
4  15:22:58        0                  JMX 31690:gc-time
5  15:22:58        0              CPU pid=26298:percent
6  15:22:58   503000                             JMX MB

Group the DataFrame by label

In [154]: g =  df.groupby('label')

Now you can get what you want

In [155]: g.get_group('JMX MB')
Out[155]:
  timeStamp   value   label
1  15:22:57  461000  JMX MB
6  15:22:58  503000  JMX MB

answered Nov 12 '22 00:11

waitingkuo

Related questions
                            
                                Stop pylab overlaying plots?
                            
                                Passing variable changes between threads in Python functions [Beginner]
                            
                                Argumentless lambdas in Python?
                            
                                Get all available timezones
                            
                                Finding Live Nodes on LAN using Python
                            
                                How to store the result of an executed function and re-use later?
                            
                                Append item to a specified list in a list of lists (Python) [duplicate]
                            
                                for/empty loop condition in python
                            
                                How can I get a SQLAlchemy ORM object's previous state after a db update?
                            
                                Python 3.x : move to next line
                            
                                running a test suite (an arbitrary collection of tests) with py.test
                            
                                Installing PyGame onto Mountain Lion OS
                            
                                matplotlib text not clipped
                            
                                Reversing a string in python based on block size in python
                            
                                Python displays all of the prime numbers from 1 through 100
                            
                                SQLAlchemy: pool_size and SQLite
                            
                                Batch processing on multiple cores
                            
                                checking if year is in the string (4 consecutive digits)
                            
                                Can scikit be used from IronPython?
                            
                                Convert string of 0s and 1s to byte in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With