Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a new column into an existing Polars dataframe

I want to add a column new_column to an existing dataframe df. I know this looks like a duplicate of

Add new column to polars DataFrame

but the answer to that questions, as well as the answers to many similar questions, don't really add a column to an existing dataframe. They create a new column with another dataframe. I think this can be fixed like this:

df = df.with_columns(
    new_column = pl.lit('some_text')
)

However, rewriting the whole dataframe just to add a few columns, seems a bit of a waste to me. Is this the right approach?

like image 336
DeltaIV Avatar asked Sep 03 '25 17:09

DeltaIV


2 Answers

Your question suggests that you think that when you do

df = df.with_columns(
    new_column = pl.lit('some_text')
)

that you're copying everything over to some new df which would be really inefficient.

You're right that that would be really inefficient but that isn't what happens. A DataFrame is just a way to organize pointers to the actual data. The hierarchy is that you have, at the top, DataFrames. Within a DataFrame are Serieses which are how columns are represented. Even at the Series level, it's still just pointers, not data. It is made up of one or more chunked arrays which fit the apache arrow memory model.

When you "make a new df" all you're doing is organizing pointers, not data. The data doesn't move or copy.

Conversely consider pandas's inplace parameter. It certainly makes it seem like you're modifying things in place and not making copies.

inplace does not generally do anything inplace but makes a copy and reassigns the pointer

https://github.com/pandas-dev/pandas/issues/16529#issuecomment-323890422

The crux of the issue is that in pandas everything you do makes a copy (or several). In polars, that isn't the case so even when you assign a new df that new df is just an outer layer that points to data. The data doesn't move, nor is it copied unless you specifically execute an operation that does.

That said, there are methods which will insert columns without requiring you to use the df=df... syntax but they don't do anything different under the hood as using the preferred assignment syntax.

like image 52
Dean MacGregor Avatar answered Sep 05 '25 10:09

Dean MacGregor


from the polars api you can use df.insert_column. https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.insert_column.html

if you just need to have the index of where to add your new column at and the data. This adds the column in-place

df.insert_column(0, data)
like image 26
Jeremy Avatar answered Sep 05 '25 09:09

Jeremy