I was trying to search whether there would be a way to change the dtypes for the strings with numbers easily. For example, the problem I face is as follows:
df = pl.DataFrame({"foo":
["100CT pen", "pencils 250CT", "what 125CT soever", "this is a thing"]}
)
I could extract and create a new column named {"bar": ["100", "250", "125", ""]}. But then I couldn't find a handy function that converts this column to Int64 or float dtypes so that the result is [100, 250, 125, null].
Also, vice versa. Sometimes it would be useful to have a handy function that converts the column of [100, 250, 125, 0] to ["100", "250", "125", "0"]. Is it something that already exists?
The easiest way to accomplish this is with the cast expression.
To cast from a string to an integer (or float):
import polars as pl
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_columns(pl.col('bar').cast(pl.Int64, strict=False).alias('bar_int'))
shape: (4, 2)
┌─────┬─────────┐
│ bar ┆ bar_int │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════════╡
│ 100 ┆ 100 │
│ 250 ┆ 250 │
│ 125 ┆ 125 │
│ ┆ null │
└─────┴─────────┘
A handy list of available datatypes is here. These are all aliased under polars, so you can refer to them easily (e.g., pl.UInt64).
For the data you describe, I recommend using strict=False to avoid having one mangled number among millions of records result in an exception that halts everything.
The same process can be used to convert numbers to strings - in this case, the utf8 datatype.
Let me modify your dataset slightly:
df = pl.DataFrame({"bar": [100.5, 250.25, 1250000, None]})
df.with_columns(pl.col("bar").cast(pl.String, strict=False).alias("bar_string"))
shape: (4, 2)
┌────────┬────────────┐
│ bar ┆ bar_string │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════╡
│ 100.5 ┆ 100.5 │
│ 250.25 ┆ 250.25 │
│ 1.25e6 ┆ 1250000.0 │
│ null ┆ null │
└────────┴────────────┘
If you need more control over the formatting, you can use the map_elements method and Python's new f-string formatting.
df.with_columns(
pl.col("bar").map_elements(lambda x: f"This is ${x:,.2f}!").alias("bar_fstring")
)
shape: (4, 2)
┌────────┬────────────────────────┐
│ bar ┆ bar_fstring │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════════════════╡
│ 100.5 ┆ This is $100.50! │
│ 250.25 ┆ This is $250.25! │
│ 1.25e6 ┆ This is $1,250,000.00! │
│ null ┆ null │
└────────┴────────────────────────┘
I found this web page to be a handy reference for those unfamiliar with f-string formatting.
As an addition to @cbilot 's answer.
You don't need to use slow python lambda functions to use special string formatting of expressions. Polars has a format function for this purpose:
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_columns(
pl.format("This is {}!", pl.col("bar"))
)
shape: (4, 2)
┌─────┬──────────────┐
│ bar ┆ literal │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════════╡
│ 100 ┆ This is 100! │
│ 250 ┆ This is 250! │
│ 125 ┆ This is 125! │
│ ┆ This is ! │
└─────┴──────────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With