Quoting the Spark DataFrames, Datasets and SQL manual: <blockquote> A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL. </blockquote> Being new to Spark, I'm a bit baffled by this for two reasons: <ol> <li>Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?</li> <li>Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.</li> </ol> I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be. I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

Indexing input data <ul> <li>The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.</li> <li>If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown. </li> </ul> Indexing Distributed Data Structures: <ul> <li>standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic. </li> <li>high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.</li> </ul> That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform. Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support. Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

Why Spark SQL considers the support of indexes unimportant?

Tags:

sql

apache-spark

apache-spark-sql

in-memory-database

Quoting the Spark DataFrames, Datasets and SQL manual:

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

Being new to Spark, I'm a bit baffled by this for two reasons:

Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

616

asked Apr 29 '16 12:04

hillel

1 Answers

Indexing input data

The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.

Indexing Distributed Data Structures:

standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.

That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

answered Oct 20 '22 09:10

zero323

Related questions
                            
                                SQL query to determine that values in a column are unique
                            
                                Re-order columns of table in Oracle
                            
                                SQL Query To Obtain Value that Occurs more than once
                            
                                How to count the number of times a character appears in a SQL column?
                            
                                Create an inline SQL table on the fly (for an excluding left join)
                            
                                str_replace in SQL UPDATE?
                            
                                Oracle: SQL query that returns rows with only numeric values
                            
                                SQL Query - Concatenating Results into One String [duplicate]
                            
                                Creating a stored procedure if it does not already exist
                            
                                How to search multiple columns in MySQL?
                            
                                How can I do the equivalent of "SHOW TABLES" in T-SQL?
                            
                                Drop all databases from server
                            
                                What mysql database tables and relationships would support a Q&A survey with conditional questions? [closed]
                            
                                MySql difference between two timestamps in Seconds?
                            
                                Insert default value when parameter is null
                            
                                Merge two rows in SQL
                            
                                SQL atomic increment and locking strategies - is this safe?
                            
                                database design for 'followers' and 'followings'?
                            
                                What does double bars (||) mean in SQL? [duplicate]
                            
                                GROUP BY / aggregate function confusion in SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With