I have a pyspark dataframe with a column FullPath.
How can I use the function os.path.splitext(FullPath) to extract the extension of each entry in the FullPath column and put them in a new column?
Thanks.
You can use pyspark.sql.functions.regexp_extract() to extract the file extension:
import pyspark.sql.functions as f
data = [
('/tmp/filename.tar.gz',)
]
df = sqlCtx.createDataFrame(data, ["FullPath"])
df.withColumn("extension", f.regexp_extract("FullPath", "\.[0-9a-z]+$", 0)).show()
#+--------------------+---------+
#| FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz| .gz|
#+--------------------+---------+
However if you wanted to use os.path.splittext(), you would need to use a udf (which will be slower than the above alternative):
import os
splittext = f.udf(lambda FullPath: os.path.splitext(FullPath)[-1], StringType())
df.withColumn("extension", splittext("FullPath")).show()
#+--------------------+---------+
#| FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz| .gz|
#+--------------------+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With