Does Spark preserve record order when reading in ordered files?

Tags:

apache-spark

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just in a relative order within the file). I'd like to use this ordering information in my Spark processing, to do things like comparing a row with the previous row. I can't explicitly order the records, since there is no ordering column.

Does Spark maintain the order of records it reads in from a file? Or, is there any way to access the file-order of records from Spark?

359

asked Aug 22 '17 15:08

Jason Evans

2 Answers

Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.

In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.

answered Oct 04 '22 02:10

Ramesh Maharjan

Order is not preserved when the data is shuffled. You can, however, enumerate the rows before doing your computations. If you are using an RDD, there's a function called zipWithIndex (RDD[T] => RDD[(T, Long)]) that does exactly what you are searching.

answered Oct 04 '22 01:10

Miguel

Related questions
                            
                                pyspark: sparse vectors to scipy sparse matrix
                            
                                how to order my tuple of spark results descending order using value
                            
                                spark-submit for a .scala file
                            
                                Setting YARN queue in PySpark
                            
                                Apache Spark Stderr and Stdout
                            
                                Apache Spark : JDBC connection not working
                            
                                Can I change SparkContext.appName on the fly?
                            
                                Building Apache Spark using SBT: Invalid or corrupt jarfile
                            
                                How to transform data with sliding window over time series data in Pyspark
                            
                                Could you give me any clue Why 'Cannot call methods on a stopped SparkContext'?
                            
                                PySpark: Randomize rows in dataframe
                            
                                Spark "replacing null with 0" performance comparison
                            
                                Can SparkContext and StreamingContext co-exist in the same program?
                            
                                How to find pyspark dataframe memory usage?
                            
                                How to do count(*) within a spark dataframe groupBy
                            
                                User defined function to be applied to Window in PySpark?
                            
                                How does the fold action work in Spark?
                            
                                Calculating percentage of total count for groupBy using pyspark
                            
                                Why does sortBy transformation trigger a Spark job?
                            
                                Error initializing SparkContext: A master URL must be set in your configuration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With