numpy.loadtxt: how to ignore comma delimiters that appear inside quotes?

Tags:

I have a csv file where a line of data might look like this:

10,"Apple, Banana",20,...

When I load the data in Python, the extra comma inside the quotes shifts all my column indices around, so my data is no longer a consistent structure. While I could probably write a complex algorithm that iterates through each row and fixes the issue, I was hoping there was an elegant way to just pass an extra parameter to loadtxt (or some other function) that will properly ignore commas inside quotes and treat the entire quote as one value.

Note, when I load the CSV file into Excel, Excel correctly recognizes the string as one value.

1000

asked Feb 11 '15 00:02

cas5nq

Video Answer

3 Answers

One way you could do it with a single numpy function call would be to use np.fromregex, which allows you to use Python's regular expression syntax to parse the contents of your text file in any arbitrary way. For example:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)

gives you:

array([['10', 'Apple, Banana', '20'],
       ['30', 'Orange, Watermelon', '40']], dtype=object)

To unpack that regular expression a bit, '(\d+)' will match one or more digits and '"(.+)"' will match one or more of any character inside double quotes. np.fromregex tries to match this expression within every line in your .csv file, and the parts that are inside the brackets are taken as the individual elements in each row of the output array.

If you want a record array as your output with different 'fields' for the three 'columns' in your .csv file, you could specify separate dtypes for each set of brackets in the regex:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')

gives you:

array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)], 
      dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])

137

answered Oct 13 '22 20:10

ali_m

This issue has been discussed before. There isn't a parameter in loadtxt (or genfromtxt) that does what you want. In other words, it is not quote sensitive. The python csv module has some sort of quote awareness. The pandas reader is also quote aware.

But processing the lines before passing them to loadtxt is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.

A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.

Using numpy.genfromtxt to read a csv file with strings containing commas

For example:

txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""

def foo(astr):
    # replace , outside quotes with ;
    # a bit crude and specialized
    x = astr.split('"')
    return ';'.join([i.strip(',') for i in x]) 

txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines())  # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)

produces:

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

I hadn't paid attention to np.fromregex before. Compared to genfromtxt it is surprisingly simple. To use with my sample txt I have to use a string buffer:

s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')

It's action distills down to:

pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)

It reads the whole file (f.read()) and does a findall which should produce a list like:

[('10', 'Apple, Banana', '20'),
 ('30', 'Pear, Orange', '40'),
 ('50', 'Peach, Mango', '60')]

A list of tuples is exactly what a structured array requires.

No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.

Both my foo and fromregex assume a specific sequence of numbers and quoted strings. The csv.reader might be the simplest general purpose quote reader. The join is required because reader produces a list of lists, while genfromtxt wants an iterable of strings (it does its own 'split').

from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)

producing

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

Or in following the fromregex example, the reader output could be turned into a list of tuples and given to np.array directly:

np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')

answered Oct 13 '22 19:10

hpaulj

I solved this by using my code below.

def transformCommas(line):
    out = ''
    insideQuote = False
    for c in line:
        if c == '"':
            insideQuote = not insideQuote
        if insideQuote == True and c == ',':
            out += '.'
        else:
            out += c
    return out

f = open("data/raw_data_all.csv", "rb")
replaced = (transformCommas(line) for line in f)
rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)

Data:

1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.00
1366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.00
1366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00

answered Oct 13 '22 19:10

Gary Fibiger

Related questions
                            
                                Multiple value search in dictionary keys
                            
                                How can I debug a Flask application that has a custom exception handler?
                            
                                Viewset 'create' custom assign value in Django Rest Framework
                            
                                Pygame Rect, what are the arguments?
                            
                                Bash alias --> Python 2.7 to Python 3.3
                            
                                Python modules: functions calling each other
                            
                                How to check if the number can be represented prime power (nth root is prime or not)
                            
                                Python - Limit cpu usage for a script
                            
                                How to get either side of equation in SymPy?
                            
                                how to “list_display” in a Django ModelAdmin display attributes of ForeignKey fields?
                            
                                Celery Worker sleep not working correctly
                            
                                Python Abstract Base Classes: Why doesn't abc prevent instantiation?
                            
                                Unable to upload files with Flask on Heroku
                            
                                Difference between .score() and .predict in the sklearn library?
                            
                                Matplotlib doesn't show hatches in custom legend
                            
                                How Do I Make Private Variables Inaccessable in Python? [duplicate]
                            
                                Why does pip say "No module named commands.install"?
                            
                                With Python`s PIL, how to set DPI before loading an image?
                            
                                How to skip delete confirmation page in Django admin for specific model?
                            
                                Cannot get Pandas to install ! Help! (pip install pandas)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numpy.loadtxt: how to ignore comma delimiters that appear inside quotes?

Tags:

python

csv

numpy