As a followup to my question on mixed types in a column:
Can I think of a DataFrame
as a list of columns or is it a list of rows?
In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.
For the documentation:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
This implies that a DataFrame is a list of columns.
Does it mean that appending a row to a DataFrame
is more expensive than appending a column?
Data Frames are data displayed in a format as a table. Data Frames can have different types of data inside it. While the first column can be character , the second and third can be numeric or logical . However, each column should have the same type of data.
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures.
Pandas DataFrame describe() MethodThe describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.
You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).
Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object
dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:
In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})
In [3]: df.dtypes
Out[3]:
bool_col bool
float_col float64
int_col int64
dtype: object
Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.
And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With