I'm running into a problem importing multiple small csv files with over <code>250000 columns of float64</code> into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label. When I run the following in pyspark <pre class="prettyprint"><code>csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED") </code></pre> I get a <blockquote> File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings: </blockquote> <ol> <li>Where/how do I set the maximum columns for the parser to use the data for machine learning. </li> <li>Is there a better way to ingest the data for use with Apache mllib?</li> </ol> This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?

Use <code>option</code>: <pre class="prettyprint"><code>spark.read.option("maxColumns", n).csv(...) </code></pre> where <code>n</code> is number of columns.

How to import csv files with massive column count into Apache Spark 2.0

Tags:

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label.

When I run the following in pyspark

csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")

I get a

File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings:

Where/how do I set the maximum columns for the parser to use the data for machine learning.
Is there a better way to ingest the data for use with Apache mllib?

This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?

228

asked Aug 27 '16 19:08

mobcdi

1 Answers

Use option:

spark.read.option("maxColumns", n).csv(...)

where n is number of columns.

180

answered Sep 25 '22 01:09

user6022341

Related questions
                            
                                Is there any Qt container that returns its values as comma separated string?
                            
                                How to read csv file in R where some values contain the percent symbol (%)
                            
                                Reading data from a CSV file online in Python 3
                            
                                convert csv to netcdf
                            
                                Angular JS - How to export Javascript Object to XLS file ?
                            
                                Excel is stripping leading 0's from CSV files
                            
                                Write python dictionary to CSV columns: keys to first column, values to second
                            
                                Read CSV file using Pandas: complex separator
                            
                                Multiple pandas.dataframe to one csv file
                            
                                Smartly converting array of hashes to CSV in ruby
                            
                                python csv header ignore while keep appending data to csv file
                            
                                In Scala, how to stop reading lines from a file as soon as a criterion is accomplished?
                            
                                Python - write list list of lists in columns
                            
                                R - CSV error - unexpected numeric constant
                            
                                Reading csv file, having numbers and strings in one column
                            
                                How can a Eigen matrix be written to file in CSV format?
                            
                                Magento importing images
                            
                                How to make grouper and axis the same length?
                            
                                How do I merge two CSV files based on field and keep same number of attributes on each record?
                            
                                How to read a csv django http response

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to import csv files with massive column count into Apache Spark 2.0

Tags:

csv

apache-spark

pyspark

google-cloud-dataproc

apache-spark-mllib

mobcdi

People also ask

1 Answers

user6022341

Recent Activity

Donate For Us