Currently Spark has two implementations for Row:
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal APIs) and the other is used with the external APIs?
Row: Externally-facing, and immutable. They are used by developers when interacting with DataFrames and Datasets.InternalRow: Internally-facing, mutable, and performance-optimized and used by Spark's engine for executing and optimizing queries.Yes, you are correct that both Row and InternalRow serve similar purposes in representing a row of data. Still, they are designed for different use cases and environments within Spark.
Why Does Spark Have Both
RowandInternalRow?
Row:
Row is designed to be part of the public API, meaning it is intended for external use by developers when interacting with DataFrames and Datasets.Row is immutable, making it safe to use in parallel processing without concerns about accidentally modifying data.InternalRow:
InternalRow is designed for internal use by Spark's Catalyst optimizer and query execution engine. It is optimized for performance, particularly in memory usage and access speed.Row, InternalRow is mutable, allowing Spark to modify row data in place during execution. This mutability is critical for internal operations, such as query optimization and evaluation, where data must be adjusted on the fly.External API (Row):
Row provides this interface, making it straightforward to work with structured data.Row is used in user-facing operations like querying, displaying, and transforming data.Internal API (InternalRow):
InternalRow is designed for these internal processes where speed and memory efficiency are critical.InternalRow is used in operations like logical and physical query plan execution, where the overhead of a more user-friendly interface would be detrimental to performance.Do They Represent the Same Encoded Entities?
Row and InternalRow ultimately represent the same concept: a row of data in a structured format (like a DataFrame). However, they are optimized for different purposes:
Row is optimized for usability and safety in the context of public APIs.InternalRow is optimized for performance and flexibility within Spark's internal execution engine.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With