Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy.loadtxt: how to ignore comma delimiters that appear inside quotes?

Tags:

python

csv

numpy

I have a csv file where a line of data might look like this:

10,"Apple, Banana",20,...

When I load the data in Python, the extra comma inside the quotes shifts all my column indices around, so my data is no longer a consistent structure. While I could probably write a complex algorithm that iterates through each row and fixes the issue, I was hoping there was an elegant way to just pass an extra parameter to loadtxt (or some other function) that will properly ignore commas inside quotes and treat the entire quote as one value.

Note, when I load the CSV file into Excel, Excel correctly recognizes the string as one value.

like image 1000
cas5nq Avatar asked Feb 11 '15 00:02

cas5nq


People also ask

What does the Loadtxt () function do in Numpy?

loadtxt() function. The loadtxt() function is used to load data from a text file. Each row in the text file must have the same number of values.

How do I skip rows in NP Loadtxt?

Use help(np. loadtxt) . You'll find the skiprows parameter will allow you to skip the first N rows: In [1]: import numpy as np In [2]: help(np.

What is the default number to skip rows in Numpy Loadtxt?

strip() or 0) will convert empty fields to 0. Default: None. Skip the first skiprows lines, including comments; default: 0.


Video Answer


3 Answers

One way you could do it with a single numpy function call would be to use np.fromregex, which allows you to use Python's regular expression syntax to parse the contents of your text file in any arbitrary way. For example:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)

gives you:

array([['10', 'Apple, Banana', '20'],
       ['30', 'Orange, Watermelon', '40']], dtype=object)

To unpack that regular expression a bit, '(\d+)' will match one or more digits and '"(.+)"' will match one or more of any character inside double quotes. np.fromregex tries to match this expression within every line in your .csv file, and the parts that are inside the brackets are taken as the individual elements in each row of the output array.

If you want a record array as your output with different 'fields' for the three 'columns' in your .csv file, you could specify separate dtypes for each set of brackets in the regex:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')

gives you:

array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)], 
      dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])
like image 137
ali_m Avatar answered Oct 13 '22 20:10

ali_m


This issue has been discussed before. There isn't a parameter in loadtxt (or genfromtxt) that does what you want. In other words, it is not quote sensitive. The python csv module has some sort of quote awareness. The pandas reader is also quote aware.

But processing the lines before passing them to loadtxt is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.

A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.

Using numpy.genfromtxt to read a csv file with strings containing commas

For example:

txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""

def foo(astr):
    # replace , outside quotes with ;
    # a bit crude and specialized
    x = astr.split('"')
    return ';'.join([i.strip(',') for i in x]) 

txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines())  # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)

produces:

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

I hadn't paid attention to np.fromregex before. Compared to genfromtxt it is surprisingly simple. To use with my sample txt I have to use a string buffer:

s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')

It's action distills down to:

pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)

It reads the whole file (f.read()) and does a findall which should produce a list like:

[('10', 'Apple, Banana', '20'),
 ('30', 'Pear, Orange', '40'),
 ('50', 'Peach, Mango', '60')]

A list of tuples is exactly what a structured array requires.

No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.


Both my foo and fromregex assume a specific sequence of numbers and quoted strings. The csv.reader might be the simplest general purpose quote reader. The join is required because reader produces a list of lists, while genfromtxt wants an iterable of strings (it does its own 'split').

from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)

producing

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

Or in following the fromregex example, the reader output could be turned into a list of tuples and given to np.array directly:

np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')
like image 26
hpaulj Avatar answered Oct 13 '22 19:10

hpaulj


I solved this by using my code below.

def transformCommas(line):
    out = ''
    insideQuote = False
    for c in line:
        if c == '"':
            insideQuote = not insideQuote
        if insideQuote == True and c == ',':
            out += '.'
        else:
            out += c
    return out

f = open("data/raw_data_all.csv", "rb")
replaced = (transformCommas(line) for line in f)
rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)

Data:

1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.00
1366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.00
1366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00
like image 45
Gary Fibiger Avatar answered Oct 13 '22 19:10

Gary Fibiger