Truncate a string with pyspark

Question

I am currently working on PySpark with Databricks and I was looking for a way to truncate a string just like the excel right function does. For example, I would like to change for an ID column in a DataFrame 8841673_3 into 8841673.

Does anybody knows how I should proceed?

Alper t. Turker · Accepted Answer

Regular expressions with regexp_extract:

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([("8841673_3", )], ("id", ))

df.select(regexp_extract("id", "^(\d+)_.*", 1)).show()
# +--------------------------------+
# |regexp_extract(id, ^(\d+)_.*, 1)|
# +--------------------------------+
# |                         8841673|
# +--------------------------------+

regexp_replace:

from pyspark.sql.functions import regexp_replace

df.select(regexp_replace("id", "_.*$", "")).show()
# +--------------------------+
# |regexp_replace(id, _.*$, )|
# +--------------------------+
# |                   8841673|
# +--------------------------+

or just split:

from pyspark.sql.functions import split

df.select(split("id", "_")[0]).show()
# +---------------+
# |split(id, _)[0]|
# +---------------+
# |        8841673|
# +---------------+

Truncate a string with pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

Frederi ROSE

1 Answers

Alper t. Turker

Recent Activity

Donate For Us

Truncate a string with pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

Frederi ROSE

1 Answers

Alper t. Turker

Related questions

Recent Activity

Donate For Us