I would like to read in a file with the following structure with Apache Spark. <pre class="prettyprint"><code>628344092\t20070220\t200702\t2007\t2007.1370 </code></pre> The delimiter is \t. How can I implement this while using spark.read.csv()? The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to <pre class="prettyprint"><code>pandas.read_csv(file, sep = '\t') </code></pre> Thanks a lot!

Use <code>spark.read.option("delimiter", "\t").csv(file)</code> or <code>sep</code> instead of <code>delimiter</code>. If it's literally <code>\t</code>, not tab special character, use double <code>\</code>: <code>spark.read.option("delimiter", "\\t").csv(file)</code>

Custom delimiter csv reader spark

Tags:

csv

apache-spark

pyspark

I would like to read in a file with the following structure with Apache Spark.

Click to copy

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

Click to copy

pandas.read_csv(file, sep = '\t')

Thanks a lot!

356

asked Sep 21 '17 17:09

inneb

1 Answers

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

138

answered Sep 30 '22 10:09

T. Gawęda

Related questions
                            
                                OSError: Initializing from file failed on csv in Pandas
                            
                                Pandas ParserError EOF character when reading multiple csv files to HDF5
                            
                                sep=";" statement breaks utf8 BOM in CSV file which is generated by XSL
                            
                                Is csv with multi tabs/sheet possible?
                            
                                Python pandas: output dataframe to csv with integers
                            
                                selecting across multiple columns with python pandas?
                            
                                Python csv.DictReader: parse string?
                            
                                How to get a .csv file into R?
                            
                                Python read csv - BOM embedded into the first key
                            
                                swift 3.1 how to get array or dictionary from CSV
                            
                                How can I parse CSV files on the Linux command line? [closed]
                            
                                Is it possible to keep the column order using csv.DictReader?
                            
                                How to convert a tab-separated file into a comma-separated file?
                            
                                String parsing in Java with delimiter tab "\t" using split
                            
                                setting a UTF-8 in java and csv file [duplicate]
                            
                                How can I get the total number of rows in a CSV file with PHP?
                            
                                How to parse a CSV in a Bash script?
                            
                                Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
                            
                                How can I read tar.gz file using pandas read_csv with gzip compression option?
                            
                                How do I download a file using VBA (without Internet Explorer)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With