I am trying to read in a csv file with <code>numpy.genfromtxt</code> but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv': <pre class="prettyprint"><code>2012, "Louisville KY", 3.5 2011, "Lexington, KY", 4.0 </code></pre> the code <pre class="prettyprint"><code>np.genfromtxt('t.csv', delimiter=',') </code></pre> produces the error: <blockquote> ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3) </blockquote> The data structure I am looking for is: <pre class="prettyprint"><code>array([['2012', 'Louisville KY', '3.5'], ['2011', 'Lexington, KY', '4.0']], dtype='|S13') </code></pre> Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the <code>csv</code> module and then convert it to a numpy array?

The problem with the additional comma, <code>np.genfromtxt</code> does not deal with that. One simple solution is to read the file with <code>csv.reader()</code> from python's csv module into a list and then dump it into a numpy array if you like. If you really want to use <code>np.genfromtxt</code>, note that it can take iterators instead of files, e.g. <code>np.genfromtxt(my_iterator, ...)</code>. So, you can wrap a <code>csv.reader</code> in an iterator and give it to <code>np.genfromtxt</code>. That would go something like this: <pre class="prettyprint"><code>import csv import numpy as np np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t") </code></pre> This essentially replaces on-the-fly only the appropriate commas with tabs.

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray: <pre class="prettyprint"><code>import pandas data = pandas.read_csv('file.csv').as_matrix() </code></pre> Pandas will handle the "Lexington, KY" case correctly

Using numpy.genfromtxt to read a csv file with strings containing commas

Tags:

python

pandas

csv

numpy

genfromtxt

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

np.genfromtxt('t.csv', delimiter=',')

produces the error:

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

The data structure I am looking for is:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

504

asked Jul 29 '13 20:07

CraigO

4 Answers

The problem with the additional comma, np.genfromtxt does not deal with that.

One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.

If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.

That would go something like this:

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

This essentially replaces on-the-fly only the appropriate commas with tabs.

answered Oct 21 '22 19:10

Bitwise

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

188

answered Oct 21 '22 20:10

joris

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas will handle the "Lexington, KY" case correctly

answered Oct 21 '22 20:10

Michael Yurin

Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.

The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, newline='') as fp:
            dialect = csv.Sniffer().sniff(fp.read(1024))
            fp.seek(0)
            for row in csv.reader(fp, dialect):
                yield "\t".join(row)
    return np.recfromcsv(
        rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)