I have a CSV file with headers at the top of columns of data as:
a,b,c
1,2,3
4,5,6
7,8,9
and I need to read it in a dict of lists:
desired_result = {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}
When reading this with DictReader
I am using a nested loop to append the items to the lists:
f = 'path_to_some_csv_file.csv'
dr = csv.DictReader(open(f))
dict_of_lists = dr.next()
for k in dict_of_lists.keys():
dict_of_lists[k] = [dict_of_lists[k]]
for line in dr:
for k in dict_of_lists.keys():
dict_of_lists[k].append(line[k])
The first loop sets all values in the dict to the empty list. The next one loops over every line read in from the CSV file, from which DictReader
creates a dict of key-values. The inner loop appends the value to list matching the corresponding key, so I wind up with the desired list of dicts. I end up having to write this fairly often.
My question is, is there a more Pythonic way of doing this using built-in functions without the nested loop, or a better idiom, or an alternative way to store this data structure such that I can return an indexable list by querying with a key? If so is there also a way to format the data being ingested by column upfront?
To do so, we use csv. DictReader() . As the name suggests, it parses each row as a dictionary, using the header row to determine column names. If you don't have a header row, you may specify the fieldnames argument.
The best way to convert a CSV file to a Python dictionary is to create a CSV file object f using open("my_file. csv") and pass it in the csv. DictReader(f) method. The return value is an iterable of dictionaries, one per row in the CSV file, that maps the column header from the first row to the specific row value.
csv. Reader() allows you to access CSV data using indexes and is ideal for simple CSV files. csv. DictReader() on the other hand is friendlier and easy to use, especially when working with large CSV files.
CSV, or "comma-separated values", is a common file format for data. The csv module helps you to elegantly process data stored within a CSV file. Also see the csv documentation. This guide uses the following example file, people.
Depending on what type of data you're storing and if you're ok with using numpy, a good way to do this can be with numpy.genfromtxt
:
import numpy as np
data = np.genfromtxt('data.csv', delimiter=',', names=True)
What this will do is create a numpy Structured Array, which provides a nice interface for querying the data by header name (make sure to use names=True
if you have a header row).
Example, given data.csv
containing:
a,b,c
1,2,3
4,5,6
7,8,9
You can then access elements with:
>>> data['a'] # Column with header 'a'
array([ 1., 4., 7.])
>>> data[0] # First row
(1.0, 2.0, 3.0)
>>> data['c'][2] # Specific element
9.0
>>> data[['a', 'c']] # Two columns
array([(1.0, 3.0), (4.0, 6.0), (7.0, 9.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
genfromtext
also provides a way, as you requested, to "format the data being ingested by column up front."
converters : variable, optional
The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing data:
converters = {3: lambda s: float(s or 0)}
.
If you're willing to use a third-party library, then the merge_with
function from Toolz makes this whole operation a one-liner:
dict_of_lists = merge_with(list, *csv.DictReader(open(f)))
Using only the stdlib, a defaultdict
makes the code less repetitive:
from collections import defaultdict
import csv
f = 'test.csv'
dict_of_lists = defaultdict(list)
for record in DictReader(open(f)):
for key, val in record.items(): # or iteritems in Python 2
dict_of_lists[key].append(val)
If you need to do this often, factor it out into a function, e.g. transpose_csv
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With