I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just in a relative order within the file). I'd like to use this ordering information in my Spark processing, to do things like comparing a row with the previous row. I can't explicitly order the records, since there is no ordering column.
Does Spark maintain the order of records it reads in from a file? Or, is there any way to access the file-order of records from Spark?
Thanks for guidance but here when Spark read data/file , definitely it will store that data which it has read. So where it will store this data.. If it won't store so what is happening on reading the file.
Alternatively, Spark DataFrame/Dataset class also provides orderBy() function to sort on one or more columns. By default, it also orders by ascending. This returns the same output as the previous section.
Sorting in Spark is a multiphase process which requires shuffling: input RDD is sampled and this sample is used to compute boundaries for each output partition ( sample followed by collect ) input RDD is partitioned using rangePartitioner with boundaries computed in the first step ( partitionBy )
Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory.
Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.
In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.
Order is not preserved when the data is shuffled. You can, however, enumerate the rows before doing your computations. If you are using an RDD, there's a function called zipWithIndex
(RDD[T] => RDD[(T, Long)]
) that does exactly what you are searching.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With