Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip a specified number of columns with numpy.genfromtxt()

Tags:

python

numpy

I have a large table (numbers in text format) that I would like to load with numpy.genfromtxt(). I would like to ignore the first n columns, say 5. I do not know the size of the table (number of row or columns) in advance.

I saw that genfromtxt() has an option skip_header that allows to skip a specified number of header rows, but it seems there is no such option for columns. There is a usecols option but there I must specify the column numbers I want to keep, rather than those I want to discard (I do not know this number in advance).

Obviously I could just load the whole thing and then throw away the first n columns, but this is not elegant and is wasteful in terms of memory.

Also I could peak into the file, find the number of columns, and then construct the usecols argument, but that is rather messy.

Any ideas on how to solve this elegantly? Is there some hidden/undocumented argument that I can use?

like image 540
Bitwise Avatar asked Nov 09 '12 15:11

Bitwise


People also ask

How does Numpy Genfromtxt work?

genfromtxt. Load data from a text file, with missing values handled as specified. Each line past the first skip_header lines is split at the delimiter character, and characters following the comments character are discarded.

Which argument should be passed into Genfromtxt If you have many column names to define from the data?

The only mandatory argument of genfromtxt is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or io. StringIO object.

What is delimiter in NumPy?

delimiter : The string used to separate values. By default, this is any whitespace. converters : A dictionary mapping column number to a function that will convert that column to a float. E.g., if column 0 is a date string: converters = {0: datestr2num}.


2 Answers

For older versions of numpy, peeking at the first line to discover the number of columns is not that hard:

import numpy as np
with open(fname, 'r') as f:
    num_cols = len(f.readline().split())
    f.seek(0)
    data = np.genfromtxt(f, usecols = range(5,num_cols))
print(data)
like image 81
unutbu Avatar answered Oct 14 '22 18:10

unutbu


In newer versions of Numpy, np.genfromtxt can take an iterable argument, so you can wrap the file you're reading in a generator that generates lines, skipping the first N columns. If your numbers are space-separated, that's something like

np.genfromtxt(" ".join(ln.split()[N:]) for ln in f)
like image 38
Fred Foo Avatar answered Oct 14 '22 19:10

Fred Foo