Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - how to get filename with parent folder from dataframe column

I am using pyspark as code language. I added column to get filename with path.

from pyspark.sql.functions import input_file_name
data = data.withColumn("sourcefile",input_file_name())

I want to retrieve only filename with it's parent folder from this column. Please help.

Example:

Inputfilename = "adl://dotdot.com/ingest/marketing/abc.json"

What output I am looking for is:

marketing/abc.json

Note: String operation I can do. The filepath column is part of dataframe.

like image 664
Hemant Chandurkar Avatar asked Jan 28 '23 02:01

Hemant Chandurkar


1 Answers

If you want to keep the value in a dataframe column you could use the pyspark.sql.function regexp_extract. You can apply it to the column with the value of path and passing the regular expression required to extract the desired part:

data = data.withColumn("sourcefile",input_file_name())

regex_str = "[\/]([^\/]+[\/][^\/]+)$"
data = data.withColumn("sourcefile", regexp_extract("sourcefile",regex_str,1))
like image 148
Marcial Gonzalez Avatar answered Feb 02 '23 09:02

Marcial Gonzalez