How to read only n rows of large CSV file on HDFS using spark-csv package?

Tags:

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")

now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.

df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)

but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:

pd_df = pandas.read_csv("file_path", nrows=20)

Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?

I want

df.count()

to give me only n and not all rows, is it possible ?

930

asked May 31 '17 06:05

2 Answers

You can use limit(n).

sqlContext.format('com.databricks.spark.csv') \
          .options(header='true', inferschema='true').load("file_path").limit(20)

This will just load 20 rows.

108

answered Sep 24 '22 01:09

My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).

val numberOfLines = ...
spark.
  read.
  text("myfile.csv").
  limit(numberOfLines).
  write.
  text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
  read.
  option("inferSchema", true). // <-- you are in exploration mode, aren't you?
  csv(s"myfile-$numberOfLines.csv")

answered Sep 24 '22 01:09

Jacek Laskowski

Related questions
                            
                                How do I add a column to a nested struct in a pyspark dataframe?
                            
                                Spark Launcher waiting for job completion infinitely
                            
                                How to turn off INFO from logs in PySpark with no changes to log4j.properties?
                            
                                how to use Regexp_replace in spark
                            
                                Spark Implicit $ for DataFrame
                            
                                spark off heap memory config and tungsten
                            
                                It is possible to start an embedded instance of apache Spark node?
                            
                                Is caching the only advantage of spark over map-reduce?
                            
                                When does shuffling occur in Apache Spark?
                            
                                Stackoverflow due to long RDD Lineage
                            
                                How to check version of Spark and Scala in Zeppelin?
                            
                                ETL in Java Spring Batch vs Apache Spark Benchmarking
                            
                                Modify collection inside a Spark RDD foreach
                            
                                PySpark — UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                Replace missing values with mean - Spark Dataframe
                            
                                Spark-Submit: --packages vs --jars
                            
                                How do you perform basic joins of two RDD tables in Spark using Python?
                            
                                Spark RDD default number of partitions
                            
                                How can I get the current SparkSession in any place of the codes?
                            
                                Not able to import Spark Implicits in ScalaTest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read only n rows of large CSV file on HDFS using spark-csv package?

Tags:

apache-spark

apache-spark-sql

spark-csv

pyspark

hdfs

Abhishek

People also ask

2 Answers

eliasah

Jacek Laskowski

Recent Activity

Donate For Us