I am trying to read in a csv file with numpy.genfromtxt
but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv
module and then convert it to a numpy array?
To read CSV data into a record in a Numpy array you can use the Numpy library genfromtxt() function, In this function's parameter, you need to set the delimiter to a comma. The genfromtxt() function is used quite frequently to load data from text files in Python.
The delimiter argument Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma ( , ) or a semicolon ( ; ) as delimiter: >>> data = u"1, 2, 3\n4, 5, 6" >>> np. genfromtxt(StringIO(data), delimiter=",") array([[1., 2., 3.], [4., 5., 6.]])
The only mandatory argument of genfromtxt is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or io.
The problem with the additional comma, np.genfromtxt
does not deal with that.
One simple solution is to read the file with csv.reader()
from python's csv module into a list and then dump it into a numpy array if you like.
If you really want to use np.genfromtxt
, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...)
. So, you can wrap a csv.reader
in an iterator and give it to np.genfromtxt
.
That would go something like this:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
Make a better function that combines the power of the standard csv
module and Numpy's recfromcsv
. For instance, the csv
module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
The example genfromcsv_mod
function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, newline='') as fp:
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
for row in csv.reader(fp, dialect):
yield "\t".join(row)
return np.recfromcsv(
rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With