Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark preserve record order when reading in ordered files?

Tags:

apache-spark

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just in a relative order within the file). I'd like to use this ordering information in my Spark processing, to do things like comparing a row with the previous row. I can't explicitly order the records, since there is no ordering column.

Does Spark maintain the order of records it reads in from a file? Or, is there any way to access the file-order of records from Spark?

like image 359
Jason Evans Avatar asked Aug 22 '17 15:08

Jason Evans


People also ask

What happens in Spark when we read a file?

Thanks for guidance but here when Spark read data/file , definitely it will store that data which it has read. So where it will store this data.. If it won't store so what is happening on reading the file.

Are Spark Dataframes ordered?

Alternatively, Spark DataFrame/Dataset class also provides orderBy() function to sort on one or more columns. By default, it also orders by ascending. This returns the same output as the previous section.

How does sort work in Spark?

Sorting in Spark is a multiphase process which requires shuffling: input RDD is sampled and this sample is used to compute boundaries for each output partition ( sample followed by collect ) input RDD is partitioned using rangePartitioner with boundaries computed in the first step ( partitionBy )

Where does Spark save files?

Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory.


2 Answers

Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.

In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.

like image 50
Ramesh Maharjan Avatar answered Oct 04 '22 02:10

Ramesh Maharjan


Order is not preserved when the data is shuffled. You can, however, enumerate the rows before doing your computations. If you are using an RDD, there's a function called zipWithIndex (RDD[T] => RDD[(T, Long)]) that does exactly what you are searching.

like image 44
Miguel Avatar answered Oct 04 '22 01:10

Miguel