Extract file extension from Pyspark Dataframe column

Question

I have a pyspark dataframe with a column FullPath.

How can I use the function os.path.splitext(FullPath) to extract the extension of each entry in the FullPath column and put them in a new column?

Thanks.

pault · Accepted Answer

You can use pyspark.sql.functions.regexp_extract() to extract the file extension:

import pyspark.sql.functions as f
data = [
    ('/tmp/filename.tar.gz',)
]

df = sqlCtx.createDataFrame(data, ["FullPath"])
df.withColumn("extension", f.regexp_extract("FullPath", "\.[0-9a-z]+$", 0)).show()
#+--------------------+---------+
#|            FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz|      .gz|
#+--------------------+---------+

However if you wanted to use os.path.splittext(), you would need to use a udf (which will be slower than the above alternative):

import os
splittext = f.udf(lambda FullPath: os.path.splitext(FullPath)[-1], StringType())
df.withColumn("extension", splittext("FullPath")).show()
#+--------------------+---------+
#|            FullPath|extension|
#+--------------------+---------+
#|/tmp/filename.tar.gz|      .gz|
#+--------------------+---------+

Extract file extension from Pyspark Dataframe column

Tags:

python

dataframe

pyspark

Ahmad

1 Answers

pault

Recent Activity

Donate For Us

Extract file extension from Pyspark Dataframe column

Tags:

python

dataframe

pyspark

Ahmad

1 Answers

pault

Related questions

Recent Activity

Donate For Us