Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escape New line character in Spark CSV read

I'm working on Spark 2.2.1 version and using the below python code, I can able to escape special characters like @ : I want to escape the special characters like newline(\n) and carriage return(\r). I replaced the @ which \n, however it didn't worked. Any suggestions please.

Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="@")

Not Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="\n")
like image 995
data_addict Avatar asked Feb 15 '18 04:02

data_addict


2 Answers

If your goal is to read csv having textual content with multiple newlines in it, then the way to go is using the spark multiline option.

I recently posted some code for scala there.

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")

The python syntax will be slightly different but shoud work well.

like image 137
parisni Avatar answered Oct 10 '22 12:10

parisni


You can achieve this using pandas.

Sample Code:

pandas_df = pd.read_csv("file.csv")
pandas_df = pandas_df.replace({r'\\r': ''}, regex=True)
pandas_df = pandas_df.replace({r'\\n': ''}, regex=True)

You can replace any special character with the above code snippet.

Later on you can convert the pandas_df to spark_df as needed.

spark_df = sqlContext.createDataFrame(pandas_df)
like image 38
data_addict Avatar answered Oct 10 '22 12:10

data_addict