Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom delimiter csv reader spark

I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370 

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t') 

Thanks a lot!

like image 356
inneb Avatar asked Sep 21 '17 17:09

inneb


People also ask

How do I read a CSV file with delimiter in Spark?

Use spark. read. option("delimiter", "\t"). csv(file) or sep instead of delimiter .

How do you specify delimiter in PySpark?

To enable spark to consider the "||" as a delimiter, we need to specify "sep" as "||" explicitly in the option() while reading the file. Spark infers "," as the default delimiter.


1 Answers

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

like image 138
T. Gawęda Avatar answered Sep 30 '22 10:09

T. Gawęda