I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
Examples:
1) when indexing columns by a boolean mask where the mask only has one true value:
df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar'])
mask = [False, True, False]
type(df.ix[:,mask])
2) when setting an index on DataFrame that only has two columns to begin with:
df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar']
type(df.set_index('foo'))
I feel like if I'm expecting a DF with only one column, I can deal with it by just calling
pd.Series(df.values().ravel(), index = df.index)
But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
So, the Series is the data structure for a single column of a DataFrame , not only conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series .
You can create a DataFrame from multiple Series objects by adding each series as a columns. By using concat() method you can merge multiple series together into DataFrame. This takes several params, for our scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.
In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.
In contrast, if you do something like df.ix[:,'col']
, it returns a Series, because there is no way that passing one column name to select can ever select more than one column.
The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col')
, it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.
Note that there is also the DataFrame method .squeeze()
for turning a one-column DataFrame into a Series.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With