I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features. Currently, I do the following: <pre class="prettyprint"><code>data = pandas.read_csv('mydata.csv') </code></pre> which gives something like: <pre class="prettyprint"><code>data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde')) </code></pre> I'd like to slice this dataframe in two dataframes: one containing the columns <code>a</code> and <code>b</code> and one containing the columns <code>c</code>, <code>d</code> and <code>e</code>. It is not possible to write something like <pre class="prettyprint"><code>observations = data[:'c'] features = data['c':] </code></pre> I'm not sure what the best method is. Do I need a <code>pd.Panel</code>? By the way, I find dataframe indexing pretty inconsistent: <code>data['a']</code> is permitted, but <code>data[0]</code> is not. On the other side, <code>data['a':]</code> is not permitted but <code>data[0:]</code> is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that <code>data[0] != data[0:1]</code>

<h3>2017 Answer - pandas 0.20: .ix is deprecated. Use .loc</h3> See the deprecation in the docs <code>.loc</code> uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with <code>.loc</code> includes the last element. <blockquote> Let's assume we have a DataFrame with the following columns: <code>foo</code>, <code>bar</code>, <code>quz</code>, <code>ant</code>, <code>cat</code>, <code>sat</code>, <code>dat</code>. </blockquote> <pre class="prettyprint"><code># selects all rows and all columns beginning at 'foo' up to and including 'sat' df.loc[:, 'foo':'sat'] # foo bar quz ant cat sat </code></pre> <code>.loc</code> accepts the same slice notation that Python lists do for both row and columns. Slice notation being <code>start:stop:step</code> <pre class="prettyprint"><code># slice from 'foo' to 'cat' by every 2nd column df.loc[:, 'foo':'cat':2] # foo quz cat # slice from the beginning to 'bar' df.loc[:, :'bar'] # foo bar # slice from 'quz' to the end by 3 df.loc[:, 'quz'::3] # quz sat # attempt from 'sat' to 'bar' df.loc[:, 'sat':'bar'] # no columns returned # slice from 'sat' to 'bar' df.loc[:, 'sat':'bar':-1] sat cat ant quz bar # slice notation is syntatic sugar for the slice function # slice from 'quz' to the end by 2 with slice function df.loc[:, slice('quz',None, 2)] # quz cat dat # select specific columns with a list # select columns foo, bar and dat df.loc[:, ['foo','bar','dat']] # foo bar dat </code></pre> You can slice by rows and columns. For instance, if you have 5 rows with labels <code>v</code>, <code>w</code>, <code>x</code>, <code>y</code>, <code>z</code> <pre class="prettyprint"><code># slice from 'w' to 'y' and 'foo' to 'ant' by 3 df.loc['w':'y', 'foo':'ant':3] # foo ant # w # x # y </code></pre>

How to take column-slices of dataframe in pandas

Tags:

python

slice

pandas

dataframe

numpy

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c'] features = data['c':]

I'm not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]

415

asked May 19 '12 14:05

cpa

1 Answers

2017 Answer - pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat' df.loc[:, 'foo':'sat'] # foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column df.loc[:, 'foo':'cat':2] # foo quz cat  # slice from the beginning to 'bar' df.loc[:, :'bar'] # foo bar  # slice from 'quz' to the end by 3 df.loc[:, 'quz'::3] # quz sat  # attempt from 'sat' to 'bar' df.loc[:, 'sat':'bar'] # no columns returned  # slice from 'sat' to 'bar' df.loc[:, 'sat':'bar':-1] sat cat ant quz bar  # slice notation is syntatic sugar for the slice function # slice from 'quz' to the end by 2 with slice function df.loc[:, slice('quz',None, 2)] # quz cat dat  # select specific columns with a list # select columns foo, bar and dat df.loc[:, ['foo','bar','dat']] # foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3 df.loc['w':'y', 'foo':'ant':3] #    foo ant # w # x # y

118

answered Sep 22 '22 02:09

Ted Petrou

Related questions
                            
                                How to activate virtualenv in Linux?
                            
                                In Python, how do I convert all of the items in a list to floats?
                            
                                How to get Linux console window width in Python
                            
                                Python != operation vs "is not"
                            
                                Django: Display Choice Value
                            
                                Sending HTML email using Python
                            
                                How can I mock requests and the response?
                            
                                Auto reloading python Flask app upon code changes
                            
                                Plot two histograms on single chart with matplotlib
                            
                                How to use "raise" keyword in Python [duplicate]
                            
                                Argparse: Required arguments listed under "optional arguments"?
                            
                                Transpose list of lists
                            
                                how to concatenate two dictionaries to create a new one in Python? [duplicate]
                            
                                Using Pip to install packages to Anaconda Environment
                            
                                How do you generate dynamic (parameterized) unit tests in Python?
                            
                                Getting the exception value in Python
                            
                                How to count the number of files in a directory using Python
                            
                                Python Create unix timestamp five minutes in the future
                            
                                PATH issue with pytest 'ImportError: No module named YadaYadaYada'
                            
                                What is the right way to treat Python argparse.Namespace() as a dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With