Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The nature of pandas DataFrame

Tags:

python

pandas

As a followup to my question on mixed types in a column:

Can I think of a DataFrame as a list of columns or is it a list of rows?

In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.

For the documentation:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

This implies that a DataFrame is a list of columns.

Does it mean that appending a row to a DataFrame is more expensive than appending a column?

like image 596
Dror Avatar asked Dec 09 '14 08:12

Dror


People also ask

What are the characteristics of DataFrame?

Data Frames are data displayed in a format as a table. Data Frames can have different types of data inside it. While the first column can be character , the second and third can be numeric or logical . However, each column should have the same type of data.

What is the nature of the type of data used in pandas series?

All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures.

How would you describe Panda DataFrame?

Pandas DataFrame describe() MethodThe describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.


1 Answers

You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).

Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:

In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})

In [3]: df.dtypes
Out[3]:
bool_col        bool
float_col    float64
int_col        int64
dtype: object

Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.

And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").

like image 55
joris Avatar answered Oct 26 '22 19:10

joris