Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import csv files with massive column count into Apache Spark 2.0

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label.

When I run the following in pyspark

csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")

I get a

File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings:

  1. Where/how do I set the maximum columns for the parser to use the data for machine learning.
  2. Is there a better way to ingest the data for use with Apache mllib?

This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?

like image 228
mobcdi Avatar asked Aug 27 '16 19:08

mobcdi


People also ask

How do I read a CSV file using SparkContext in PySpark?

To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing all file names comma separated. The below example reads text01. csv & text02. csv files into single RDD.


1 Answers

Use option:

spark.read.option("maxColumns", n).csv(...)

where n is number of columns.

like image 180
user6022341 Avatar answered Sep 25 '22 01:09

user6022341