Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.
This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.
This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.
1. Inefficient parser implementation. It's just... very easy to allocate way too much memory if you don't think about large-scale documents, and very difficult to measure. Common problem with many (but not all) JSON parsers.
2. CPython in-memory representation is large compared to compiled languages. So e.g. 4-digit integer is 5-6 bytes in JSON, 8 in Rust if you do i64, 25ish in CPython. An empty dictionary is 64 bytes.
At least JSON or CSV is better than the ad hoc homegrown formats you found at medium-sized companies that came out of the 90's and 00's.
Let's imagine the file is mostly full of single digit numbers with no spaces (so lists like 2,4,1,0,9,3...).
We need to spend 40 bytes storing a number.
Make a minimal sized class to store an integer:
class JsonInt:
x = 1
That object's size is already 48 bytes.Usually we store floats from JSON, the size of 1 as a float in python is 24 bytes.
Now, you can get smaller, but as soon as you introduce any kind of class structure or not parsing numbers until they are used (in case you want people to be able to intrepret them as ints or floats), you blow through 20x memory size increase.
But . . . why? Assuming they aren't BigInts or similar these are maximum 8 bytes of actual data. This overhead is ridiculous.
Using classes should enable you to be much smaller than the JSON representation, not larger. For example, V8 does it like https://v8.dev/docs/hidden-classes
> not parsing numbers until they are used
Doesn't this defeat the point of pydantic? It's supposed to be checking the model is valid as it's loaded using jiter. If the data is valid it can be loaded into an efficient representation, and if it's not the errors can be emitted during iterating over it.
This is CPython. This is how it works. It's not particularly related to JSON. That sort of overhead is put on everything. It just hurts the most when the thing you're putting the overhead on is a single integer. It hurts less when you're doing it to, say, a multi-kilobyte string.
Even in your v8 example, that's a JIT optimization, not "how the language works". You break that optimization, which you can do at any moment with any change in your code base, you're back to similar sizes.
Boxing everything lets you easily implement the dynamic scripting language's way of treating everything as an Object of some sort, but it comes at a price. There's a reason dynamic scripting languages, even after the JIT has come through, are generally substantially slower languages. This isn't the only reason, but it's a significant part of it.
The whole point of the v8 optimization is it works in the face of prototype chains that merge etc. as you add new fields dynamically so if you change your code base it adapts.
Are you able to share a snippet that reproduces what you're seeing?
>>> class A(BaseModel):
>>> a: int
>>> class B(BaseModel):
>>> b: A
>>> class C(BaseModel):
>>> c: B | Dict[str, Any]
>>> C.model_validate({'c':{'b':{'a':1}}})
C(c=B(b=A(a=1)))
>>> C.model_validate({'c':{'b':{'a':"1"}}})
C(c={'b': {'a': '1'}})
>>> class C(BaseModel):
>>> c: B | Dict[str, Any] = Field(union_mode='left_to_right')
>>> C.model_validate({'c':{'b':{'a':"1"}}})
C(c=B(b=A(a=1)))
You can have nested dataclasses, as well as specify custom serializers/loaders for things which aren't natively supported by json.
Calling `x: str = json.dumps(MyClass(...).serialize())` will get you json you can recover to the original object, nested classes and custom types and all, with `MyClass.load(json.loads(x))`
I know nothing about your context, but in what context would a single model need to support so many permutations of a data structure? Just because software can, doesn't mean it should.
Just tracking payments through multiple tax regions will explode the places where things need to be tweaked.
Automatic, statically typed deserialization is worth the trouble in my opinion
>>> from dataclasses import dataclass
>>> @dataclass
... class C: pass
...
>>> C().x = 1
>>> @dataclass(slots=True)
... class D: pass
...
>>> D().x = 1
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
D().x = 1
^^^^^
AttributeError: 'D' object has no attribute 'x' and no __dict__ for setting new attributes
Most of the time this is not a thing you actually need to do.If you're using dataclasses it's less of an issue because dataclasses.asdict.
The linked-from-original-article ijson article was the inspiration for the talk: https://pythonspeed.com/articles/json-memory-streaming/
For example if you are querying a DB that returns a column as a JSON string, trivial with Pydantic to json parse the column are part of deser with an annotation.
Pydantic is definitely slower and not a 'zero cost abstraction', but you do get a lot for it.
* You still need to load all the bytes into memory before passing to msgspec decoding
* You can decode a subset of fields, which is really helpful
* Reusing msgspec decoders saves some cpu cycles https://jcristharif.com/msgspec/perf-tips.html#reuse-encoder...
Slides 17, 18, 19 have an example of the first two points https://pythonspeed.com/pycon2025/slides/#17