I'm running into a problem importing multiple small csv files with over 250000 columns of float64
into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label.
When I run the following in pyspark
csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")
I get a
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings:
This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?
To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing all file names comma separated. The below example reads text01. csv & text02. csv files into single RDD.
Use option
:
spark.read.option("maxColumns", n).csv(...)
where n
is number of columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With