Is it possible to select a potentially non-existent column from a polars dataframe without exceptions (return a column with default values or null/None)?
The behavior I really want can be shown in the example as follows:
import polars as pl
df1 = pl.DataFrame({"id": [1, 2, 3], "bar": ["sugar", "ham", "spam"]})
df2 = pl.DataFrame({"id": [4, 5, 6], "other": ["a", "b", "b"]})
df1.write_csv("df1.csv")
df2.write_csv("df2.csv")
df = pl.scan_csv("df*.csv").select(["id", "bar"])
res = df.collect()
Now, if I run the code above, will get an error since df2.csv does not contain column "bar". The result I want is - res is just the contents in df1.csv, which means the dataframe in df2.csv will not be selected due to no column "bar" in it.
I mean as already in the comment mentioned above this functionality doesn't exist in polars, but we can construct a function which would fullfil your needs
import glob
def scan_csv_with_columns(file: str, needed_colnames: list[str]) -> pl.LazyFrame:
file_collector = []
for filename in glob.glob(file):
df_scan = pl.scan_csv(filename)
if (df_scan.columns == needed_colnames):
file_collector.append(df_scan)
df = pl.concat(file_collector, how="vertical")
return(df)
file = "df*.csv"
needed_colnames = ["id", "bar"]
df = scan_csv_with_columns(file, needed_colnames)
df.collect()
shape: (3, 2)
┌─────┬───────┐
│ id ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ sugar │
│ 2 ┆ ham │
│ 3 ┆ spam │
└─────┴───────┘
You can do that using pl.selectors.matches and a regex pattern
df = pl.DataFrame({"col1": [1,2], "col2": [3,4], "col3": [5,6]})
print(
df
.select(
pl.selectors.matches("^col1$|^col3$|^col4$")
)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With