I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned? Examples: 1) when indexing columns by a boolean mask where the mask only has one true value: <pre class="prettyprint"><code>df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar']) mask = [False, True, False] type(df.ix[:,mask]) </code></pre> 2) when setting an index on DataFrame that only has two columns to begin with: <pre class="prettyprint"><code>df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar'] type(df.set_index('foo')) </code></pre> I feel like if I'm expecting a DF with only one column, I can deal with it by just calling <pre class="prettyprint"><code>pd.Series(df.values().ravel(), index = df.index) </code></pre> But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?

In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left. In contrast, if you do something like <code>df.ix[:,'col']</code>, it returns a Series, because there is no way that passing one column name to select can ever select more than one column. The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do <code>df.set_index('col')</code>, it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have. Note that there is also the DataFrame method <code>.squeeze()</code> for turning a one-column DataFrame into a Series.

Why do I get Pandas data frame with only one column vs Series?

Tags:

python

pandas

dataframe

series

I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?

Examples:

1) when indexing columns by a boolean mask where the mask only has one true value:

df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar'])
mask = [False, True, False]
type(df.ix[:,mask])

2) when setting an index on DataFrame that only has two columns to begin with:

df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar']
type(df.set_index('foo'))

I feel like if I'm expecting a DF with only one column, I can deal with it by just calling

pd.Series(df.values().ravel(), index = df.index)

But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?

213

asked Sep 18 '14 19:09

paulsef11

1 Answers

In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.

In contrast, if you do something like df.ix[:,'col'], it returns a Series, because there is no way that passing one column name to select can ever select more than one column.

The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col'), it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.

Note that there is also the DataFrame method .squeeze() for turning a one-column DataFrame into a Series.

answered Nov 03 '22 01:11

BrenBarn

Related questions
                            
                                How to connect with Python IMAP4_SSL and self-signed server SSL cert?
                            
                                Aliasing the dict keys in a Django QuerySet.values call
                            
                                PyQt not recognizing arrow keys
                            
                                More than one module for lambdify in sympy
                            
                                Rounding errors with floats in Python using Numpy
                            
                                How to cluster multivariate angular data? Distance measures and algorithms
                            
                                Dividing multiindex columns by sum to create percentages
                            
                                mongo db findOne and $or does order of arguments matters or hierarchy? [performance]
                            
                                Unable to login to Quora using Selenium webdriver in Python
                            
                                MAMP Python-MySQLdb issue: Path to libssl.1.0.0.dylib changing once Python file called
                            
                                Setting logging levels using a variable
                            
                                Force implementation of a method in all inheriting classes
                            
                                Importing a local variable in a function into timeit
                            
                                Speedup sympy-lamdified and vectorized function
                            
                                How to work around the Queue corruption when using Process.Terminate()
                            
                                Code 200 httpresponse on django
                            
                                How to write the result of a calculation to a file in python?
                            
                                Python: Exception raised even when caught in try/except clause [duplicate]
                            
                                transpose multiple columns Pandas dataframe
                            
                                Python Syntax: Subprocess Call PostgreSQL Query, "Error: Only ASCII Characters Allowed"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With