How would you go about employing and/or implementing a case class equivalent in PySpark?
case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays.
Scala case classes are just regular classes which are immutable by default and decomposable through pattern matching. It uses equal method to compare instance structurally. It does not use new keyword to instantiate object. All the parameters listed in the case class are public and immutable by default.
Case Classes in PythonCaseClass is not actually a class, but a function that returns a class. The function takes in any number of named parameters and their types (Scala is strongly typed after-all). The function raises a ValueError if your parameter value is not a type.
A class can extend another class, whereas a case class can not extend another case class (because it would not be possible to correctly implement their equality).
As mentioned by Alex Hall a real equivalent of named product type, is a namedtuple
.
Unlike Row
, suggested in the other answer, it has a number of useful properties:
Has well defined shape and can be reliably used for structural pattern matching:
>>> from collections import namedtuple
>>>
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42
In contrast Rows
are not reliable when used with keyword arguments:
>>> from pyspark.sql import Row
>>>
>>> foobar = Row(foo=42, bar=-42)
>>> foo, bar = foobar
>>> foo
-42
>>> bar
42
although if defined with positional arguments:
>>> FooBar = Row("foo", "bar")
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42
the order is preserved.
Define proper types
>>> from functools import singledispatch
>>>
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> type(FooBar)
<class 'type'>
>>> isinstance(FooBar(42, -42), FooBar)
True
and can be used whenever type handling is required, especially with single:
>>> Circle = namedtuple("Circle", ["x", "y", "r"])
>>> Rectangle = namedtuple("Rectangle", ["x1", "y1", "x2", "y2"])
>>>
>>> @singledispatch
... def area(x):
... raise NotImplementedError
...
...
>>> @area.register(Rectangle)
... def _(x):
... return abs(x.x1 - x.x2) * abs(x.y1 - x.y2)
...
...
>>> @area.register(Circle)
... def _(x):
... return math.pi * x.r ** 2
...
...
>>>
>>> area(Rectangle(0, 0, 4, 4))
16
>>> >>> area(Circle(0, 0, 4))
50.26548245743669
and multiple dispatch:
>>> from multipledispatch import dispatch
>>> from numbers import Rational
>>>
>>> @dispatch(Rectangle, Rational)
... def scale(x, y):
... return Rectangle(x.x1, x.y1, x.x2 * y, x.y2 * y)
...
...
>>> @dispatch(Circle, Rational)
... def scale(x, y):
... return Circle(x.x, x.y, x.r * y)
...
...
>>> scale(Rectangle(0, 0, 4, 4), 2)
Rectangle(x1=0, y1=0, x2=8, y2=8)
>>> scale(Circle(0, 0, 11), 2)
Circle(x=0, y=0, r=22)
and combined with the first property, there can be used in wide ranges of pattern matching scenarios. namedtuples
also support standard inheritance and type hints.
Rows
don't:
>>> FooBar = Row("foo", "bar")
>>> type(FooBar)
<class 'pyspark.sql.types.Row'>
>>> isinstance(FooBar(42, -42), FooBar) # Expected failure
Traceback (most recent call last):
...
TypeError: isinstance() arg 2 must be a type or tuple of types
>>> BarFoo = Row("bar", "foo")
>>> isinstance(FooBar(42, -42), type(BarFoo))
True
>>> isinstance(BarFoo(42, -42), type(FooBar))
True
Provide highly optimized representation. Unlike Row
objects, tuple don't use __dict__
and carry field names with each instance. As a result there are can be order of magnitude faster to initialize:
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> %timeit FooBar(42, -42)
587 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
compared to different Row
constructors:
>>> %timeit Row(foo=42, bar=-42)
3.91 µs ± 7.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> FooBar = Row("foo", "bar")
>>> %timeit FooBar(42, -42)
2 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and are significantly more memory efficient (very important property when working with large scale data):
>>> import sys
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> sys.getsizeof(FooBar(42, -42))
64
compared to equivalent Row
>>> sys.getsizeof(Row(foo=42, bar=-42))
72
Finally attribute access is order of magnitude faster with namedtuple
:
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> %timeit foobar.foo
102 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
compared to equivalent operation on Row
object:
>>> foobar = Row(foo=42, bar=-42)
>>> %timeit foobar.foo
2.58 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Last but not least namedtuples
are properly supported in Spark SQL
>>> Record = namedtuple("Record", ["id", "name", "value"])
>>> spark.createDataFrame([Record(1, "foo", 42)])
DataFrame[id: bigint, name: string, value: bigint]
Summary:
It should be clear that Row
is a very poor substitute for an actual product type, and should be avoided unless enforced by Spark API.
It should be also clear that pyspark.sql.Row
is not intended to be a replacement of a case class when you consider that, it is direct equivalent of org.apache.spark.sql.Row
- type which is pretty far from an actual product, and behaves like Seq[Any]
(depending on a subclass, with names added). Both Python and Scala implementations were introduced as an useful, albeit awkward interface between external code and internal Spark SQL representation.
See also:
It would be a shame not to mention awesome MacroPy developed by Li Haoyi and its port (MacroPy3) by Alberto Berti:
>>> import macropy.console
0=[]=====> MacroPy Enabled <=====[]=0
>>> from macropy.case_classes import macros, case
>>> @case
... class FooBar(foo, bar): pass
...
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42
which comes with a rich set of other features including, but not limited to, advanced pattern matching and neat lambda expression syntax.
Python dataclasses
(Python 3.7+).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With