I'm working on Spark 2.2.1 version and using the below python code, I can able to escape special characters like @ : I want to escape the special characters like newline(\n) and carriage return(\r). I replaced the @ which \n, however it didn't worked. Any suggestions please.
Working:
spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="@")
Not Working:
spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="\n")
If your goal is to read csv having textual content with multiple newlines in it, then the way to go is using the spark multiline option
.
I recently posted some code for scala there.
val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")
The python syntax will be slightly different but shoud work well.
You can achieve this using pandas.
Sample Code:
pandas_df = pd.read_csv("file.csv")
pandas_df = pandas_df.replace({r'\\r': ''}, regex=True)
pandas_df = pandas_df.replace({r'\\n': ''}, regex=True)
You can replace any special character with the above code snippet.
Later on you can convert the pandas_df to spark_df as needed.
spark_df = sqlContext.createDataFrame(pandas_df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With