Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to take column-slices of dataframe in pandas

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv') 

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde')) 

I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c'] features = data['c':] 

I'm not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]

like image 415
cpa Avatar asked May 19 '12 14:05

cpa


People also ask

How do you slice a column in a list Python?

As shown in the above syntax, to slice a Python list, you have to append square brackets in front of the list name. Inside square brackets you have to specify the index of the item where you want to start slicing your list and the index + 1 for the item where you want to end slicing.

How do I slice a row in pandas DataFrame?

In this case, the first slice [0:2] is requesting only rows 0 through 1of the DataFrame. When slicing by index position in Pandas, the start index is included in the output, but the stop index is one step beyond the row you want to select. So the slice return row 0 and row 1, but does not return row 2.


1 Answers

2017 Answer - pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat' df.loc[:, 'foo':'sat'] # foo bar quz ant cat sat 

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column df.loc[:, 'foo':'cat':2] # foo quz cat  # slice from the beginning to 'bar' df.loc[:, :'bar'] # foo bar  # slice from 'quz' to the end by 3 df.loc[:, 'quz'::3] # quz sat  # attempt from 'sat' to 'bar' df.loc[:, 'sat':'bar'] # no columns returned  # slice from 'sat' to 'bar' df.loc[:, 'sat':'bar':-1] sat cat ant quz bar  # slice notation is syntatic sugar for the slice function # slice from 'quz' to the end by 2 with slice function df.loc[:, slice('quz',None, 2)] # quz cat dat  # select specific columns with a list # select columns foo, bar and dat df.loc[:, ['foo','bar','dat']] # foo bar dat 

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3 df.loc['w':'y', 'foo':'ant':3] #    foo ant # w # x # y 
like image 118
Ted Petrou Avatar answered Sep 22 '22 02:09

Ted Petrou