I have a csv file where a line of data might look like this:
10,"Apple, Banana",20,...
When I load the data in Python, the extra comma inside the quotes shifts all my column indices around, so my data is no longer a consistent structure. While I could probably write a complex algorithm that iterates through each row and fixes the issue, I was hoping there was an elegant way to just pass an extra parameter to loadtxt (or some other function) that will properly ignore commas inside quotes and treat the entire quote as one value.
Note, when I load the CSV file into Excel, Excel correctly recognizes the string as one value.
loadtxt() function. The loadtxt() function is used to load data from a text file. Each row in the text file must have the same number of values.
Use help(np. loadtxt) . You'll find the skiprows parameter will allow you to skip the first N rows: In [1]: import numpy as np In [2]: help(np.
strip() or 0) will convert empty fields to 0. Default: None. Skip the first skiprows lines, including comments; default: 0.
One way you could do it with a single numpy function call would be to use np.fromregex
, which allows you to use Python's regular expression syntax to parse the contents of your text file in any arbitrary way. For example:
np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)
gives you:
array([['10', 'Apple, Banana', '20'],
['30', 'Orange, Watermelon', '40']], dtype=object)
To unpack that regular expression a bit, '(\d+)'
will match one or more digits and '"(.+)"'
will match one or more of any character inside double quotes. np.fromregex
tries to match this expression within every line in your .csv
file, and the parts that are inside the brackets are taken as the individual elements in each row of the output array.
If you want a record array as your output with different 'fields' for the three 'columns' in your .csv
file, you could specify separate dtypes
for each set of brackets in the regex:
np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')
gives you:
array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)],
dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])
This issue has been discussed before. There isn't a parameter in loadtxt
(or genfromtxt
) that does what you want. In other words, it is not quote sensitive. The python
csv
module has some sort of quote awareness. The pandas
reader is also quote aware.
But processing the lines before passing them to loadtxt
is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.
A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.
Using numpy.genfromtxt to read a csv file with strings containing commas
For example:
txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""
def foo(astr):
# replace , outside quotes with ;
# a bit crude and specialized
x = astr.split('"')
return ';'.join([i.strip(',') for i in x])
txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines()) # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)
produces:
array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
(50, 'Peach, Mango', 60)],
dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])
I hadn't paid attention to np.fromregex
before. Compared to genfromtxt
it is surprisingly simple. To use with my sample txt
I have to use a string buffer:
s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')
It's action distills down to:
pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)
It reads the whole file (f.read()
) and does a findall
which should produce a list like:
[('10', 'Apple, Banana', '20'),
('30', 'Pear, Orange', '40'),
('50', 'Peach, Mango', '60')]
A list of tuples is exactly what a structured array requires.
No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.
Both my foo
and fromregex
assume a specific sequence of numbers and quoted strings. The csv.reader
might be the simplest general purpose quote reader. The join
is required because reader
produces a list of lists, while genfromtxt
wants an iterable of strings (it does its own 'split').
from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)
producing
array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
(50, 'Peach, Mango', 60)],
dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])
Or in following the fromregex
example, the reader
output could be turned into a list of tuples and given to np.array
directly:
np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')
I solved this by using my code below.
def transformCommas(line):
out = ''
insideQuote = False
for c in line:
if c == '"':
insideQuote = not insideQuote
if insideQuote == True and c == ',':
out += '.'
else:
out += c
return out
f = open("data/raw_data_all.csv", "rb")
replaced = (transformCommas(line) for line in f)
rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)
Data:
1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.00
1366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.00
1366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With