Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract file extension from Pyspark Dataframe column

I have a pyspark dataframe with a column FullPath.

How can I use the function os.path.splitext(FullPath) to extract the extension of each entry in the FullPath column and put them in a new column?

Thanks.

like image 502
Ahmad Avatar asked Jan 01 '26 00:01

Ahmad


1 Answers

You can use pyspark.sql.functions.regexp_extract() to extract the file extension:

import pyspark.sql.functions as f
data = [
    ('/tmp/filename.tar.gz',)
]

df = sqlCtx.createDataFrame(data, ["FullPath"])
df.withColumn("extension", f.regexp_extract("FullPath", "\.[0-9a-z]+$", 0)).show()
#+--------------------+---------+
#|            FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz|      .gz|
#+--------------------+---------+

However if you wanted to use os.path.splittext(), you would need to use a udf (which will be slower than the above alternative):

import os
splittext = f.udf(lambda FullPath: os.path.splitext(FullPath)[-1], StringType())
df.withColumn("extension", splittext("FullPath")).show()
#+--------------------+---------+
#|            FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz|      .gz|
#+--------------------+---------+
like image 70
pault Avatar answered Jan 03 '26 19:01

pault



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!