Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert singular values into lists when parsing Pydantic fields

I have an application that needs to parse some configuration. These structures often contain fields that can be either a string, or an array of strings, e.g. in YAML:

fruit: apple
vegetable:
  - tomato
  - cucumber

However, internally I'd like to have fruit=['apple'] and vegetable=['tomato', 'cucumber'] for uniformity.

I'd like to use Pydantic to do the parsing. How do I declare these fields with as little repetition as possible?

Ideal solution:

  • Retains the type of fruit and vegetable as list[str] for both typechecking (mypy) and at runtime.
  • Has at most one line of code per field, even when used in multiple classes.
  • Generalizes to arbitrary types, not just str.

I have considered:

  • fruit: Union[str, list[str]] - I will have to check the type of fruit everywhere it's used, which defeats the purpose of parsing in the first place.
  • @validator("fruit", pre=True) that converts a non-list to a list - this will have to be repeated for every field in every class, bloating up the definition by 3 extra lines per field.
like image 955
Koterpillar Avatar asked Oct 20 '25 01:10

Koterpillar


1 Answers

Suggested approach

Writing custom field validator is indeed the way to go IMHO. As a general rule, it is a good idea to define the model in terms of the schema you want at the end of the parsing process, not in terms of what you might get.

We can apply a few tricks to reduce code repetition to a minimum.

Firstly, we can define the validator as "catch-all" by registering it for the field wildcard "*".

Next, we leverage the ability to receive the respective ModelField instance as an optional argument in our validator method. It carries the (crucial) information about the shape of the field.

Finally, we rely as much as possible on the built-in "smarts" of Pydantic models and their lenient type coercion/conversion, so that we do not have to worry about specifics such as "are we dealing with a tuple or a list type field?".

If we go about this the right way, we can construct a highly generic validator that does what you want (and more) that only needs to be registered once on our custom base model and is then inherited and used by all subclasses.


Working implementation

from typing import Any, ClassVar
from pydantic import BaseModel as PydanticBaseModel, validator
from pydantic.fields import (
    SHAPE_DEQUE,
    SHAPE_FROZENSET,
    SHAPE_LIST,
    SHAPE_SEQUENCE,
    SHAPE_SET,
    SHAPE_TUPLE,
    SHAPE_TUPLE_ELLIPSIS,
    ModelField,
)

ITERABLE_SHAPES = {...}  # repeat shape constants from imports

class BaseModel(PydanticBaseModel):
    __split_sep__: ClassVar[str] = ","

    @validator("*", pre=True)
    def split_str(cls, v: Any, field: ModelField) -> Any:
        if not isinstance(v, str):
            return v  # allow default Pydantic validator to take over
        generator = (item.strip() for item in v.split(cls.__split_sep__))
        if field.shape == SHAPE_TUPLE:
            return tuple(generator)  # because Pydantic checks length first
        if field.shape in ITERABLE_SHAPES:
            return generator
        return v

    @validator('*', pre=True)
    def discard_empty_str_elements(cls, v: Any, field: ModelField) -> Any:
        if field.type_ is str and field.shape in ITERABLE_SHAPES:
            return (item for item in v if item != "")
        return v

The first special case SHAPE_TUPLE is only necessary because tuple types actually have a meaningful length, since you can define the specific types of tuple elements in advance. Pydantic does the length check before converting the object to a tuple and this breaks, if we pass it a generator (because a generator does not have a length).

If we don't need to accommodate that special case, for example because we will only ever use fields with ellipsis-type tuple fields (x: tuple[int, ...]), the entire validator gets even shorter:

    @validator("*", pre=True)
    def split_str(cls, v: Any, field: ModelField) -> Any:
        if isinstance(v, str) and field.shape in ITERABLE_SHAPES:
            return (item.strip() for item in v.split(cls.__split_sep__))
        return v

The second validator discard_empty_str_elements is just for added convenience, so that leading, trailing or consecutive separators do not result in empty elements.

We can rely on the fact that validators for the same field are called in the order they were defined.


Demo

Here is a little demo with "strings-only" input data: (JSON, not YAML, but same difference)

class GroceryList(BaseModel):
    fruits: list[str]
    vegetables: set[str]


class NumericStuff(BaseModel):
    x: tuple[int, int, int]
    y: frozenset[float]


groceries_data = '''{
    "fruits": ",apple,,orange,",
    "vegetables": "    tomato, cucumber"
}'''
groceries = GroceryList.parse_raw(groceries_data)
print(groceries)
print(groceries.json(indent=4))

numeric_data = '''{
    "x": "1,2,69",
    "y": "3.14"
}'''
numbers = NumericStuff.parse_raw(numeric_data)
print(numbers)
print(numbers.json(indent=4))

Output:

fruits=['apple', 'orange'] vegetables={'tomato', 'cucumber'}
{
    "fruits": [
        "apple",
        "orange"
    ],
    "vegetables": [
        "tomato",
        "cucumber"
    ]
}
x=(1, 2, 69) y=frozenset({3.14})
{
    "x": [
        1,
        2,
        69
    ],
    "y": [
        3.14
    ]
}

As you can see, no additional code is needed in the subclasses at all. The validators also work for different iterable types such as frozenset or deque, not just list. And the best thing is of course the fact that the (first) validator does not care at all about the actual type of the field. It only cares about the shape. So list[str] works just as well as frozenset[float].

Naturally, the way the validators are designed, the models still work just fine with already "valid" input data:

print(GroceryList.parse_raw('{"fruits": ["mango"], "vegetables": ["carrot", "cabbage"]}'))

(There is no set equivalent in JSON, so an array here and in the output above is the only sane way to represent this.)

Output: fruits=['mango'] vegetables={'carrot', 'cabbage'}

Lastly, as far as I can see, the way the validator is set up, it should not break or affect other field validation in any way. You should still be able to use any non-sequence type fields the same as before.


PS

Of course, if you don't like the base model and catch-all approach for whatever reason, you can instead define the validator as a normal function and apply it selectively as a re-usable validator.

This will of course mean one additional line of code for every model that you want to use them on. Example:

...

def split_str(v: Any, field: ModelField) -> Any:
    if isinstance(v, str) and field.shape in ITERABLE_SHAPES:
        return (item.strip() for item in v.split(","))
    return v


class Model(PydanticBaseModel):
    a: list[int]
    b: list[str]

    _split_str = validator("a", "b", pre=True, allow_reuse=True)(split_str)


print(Model.parse_raw('{"a": "-1,-2,-3", "b": "foo,bar"}'))

Output: a=[-1, -2, -3] b=['foo', 'bar']

I could imagine this approach to be better, if you care about performance because the catch-all validator is obviously called for every field. But since you were talking about parsing config files and this is typically not done millions of times for every run of the program, I would assume this is negligible.

like image 188
Daniil Fajnberg Avatar answered Oct 21 '25 20:10

Daniil Fajnberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!