The 'Subschema' Relation on JSON Schemas

JSON¹ is one of the most popular formats for language-agnostic data communication between software services. This could be synchronous communication as in HTTP REST apis, or asynchronous communication as in Kafka messages. When services communicate we need to ensure that the messages produced by the producer can be understood by the consumer. Within a service, this is the role of static types to check at compile time that all the intra-service communication is structurally valid²; between services, this is the role of the “JSON schema”³.

Unlike within a service, where with any change to an interface between code modules we can simultaneously change both modules in the same code change, two separate services are independently deployed so we can’t rely on simultaneous updates. When we want to change the structure of messages between services, we have to EITHER change the producer to create a new message, subsequently change the consumer to switch over to consuming the new message, and subsequently change the producer to stop producing the old message OR producers and consumers need to be able to, at least temporarily, operate on different JSON schemas which are ‘compatible’.

More precisely, producer and consumer JSON schemas are ‘compatible’ iff all messages that can be serialized with the producer JSON schema can be deserialized with the consumer JSON schema, and so for it to be possible for two different JSON schemas to be compatible, the rules for JSON deserialization needs to be sufficiently relaxed. Note here how JSON schemas are implicitly associated with the sets of JSON that can be serialized and deserialized with that schema; these are intrinsic to the concept of a JSON schema. However we can also ‘forget’ this detail and consider JSON schemas as algebraic entities, and with this perspective the informal notion of compatibility described above induces a formal relation on JSON schemas as algebraic entities; let’s call it the ‘subschema’ relation and denote it by \(\lesssim\). (Note that this naming and notation is only really appropriate if the relation turns out to be transitive⁴, which is not yet known and is something we will come back to later).

Conversely, given the algebraic ‘subschema’ relation on JSON schemas, we can deduce the set of JSON that can be deserialized by a given schema (assuming that the set of JSON that can be serialized by a given schema is not a variable to be determined)⁵. So it is sufficient to just define the algebraic subschema relation as a way of encoding the entire meaning of JSON schemas. Given this relation we could then also easily formalise and therefore calculate what schema changes are ‘non-breaking’ changes; a change to a producer schema is a ‘non-breaking’ change if for all consumers of the message the new producer schema is still a subschema of the consumer schema, and similarly a change to a consumer schema is ‘non-breaking’ if for all producers that target this consumer, the producer schema is still a subschema of the new consumer schema.

What does this subschema relation look like? Well, it encodes certain things we expect about the behaviour of JSON deserialization. Below are some examples of producer and consumer schema changes that I have included and excluded within the subschema relation. The list of included changes is complete (to a certain degree of precision), in the sense that the subschema relation can be defined in terms of it as: \(A\) is a ‘subschema’ of \(B\) if and only if all the differences, treating each field independently, are in this set.

Producer Schema Change	Consumer Schema Change
(\(\checkmark\)) INCLUDED	(\(\checkmark\)) INCLUDED
———————————-	———————————–
reordering fields	reordering fields
adding a field (extending a product type)	removing a field (restricting a product type)
restricting an enum (more generally, restricting a sum type)	extending an enum (more generally, extending a sum type)
changing a string to an enum	changing an enum to a string
changing a string to a number	changing a number to a string
making an optional field required	making a required field optional⁶
adding or removing an optional field	adding or removing an optional field
compatibly changing the type of a field (recursive)	compatibly changing the type of a field (recursive)
(?)	(?)
——————————–	———————————–
incompatibly changing the type of an optional field	incompatibly changing the type of an optional field
(X) EXCLUDED	(X) EXCLUDED
——————————–	———————————–
making a required field, a field with default	making a field with default, required ⁷
changing an array to an option	changing an option to an array

Not all parts of this definition are obvious though. For example it is not clear whether the change marked with a (?) – incompatibly changing the type of an optional field – should be included inside the subschema relation or not. In terms of deserialization, this corresponds to the question of whether an optional field should use the default whenever deserialization to the inner type fails, or only if the field is missing. And from this perspective of deserialization, the latter is best as it provides the strictest validation.

However, in making this change not part of the subschema relation, we would be breaking transitivity of the relation. This is because changing the type within an optional field can be composed of two changes – removing the optional field and then adding the same optional field but with a different inner type (or worse and more likely, the producer adding a field and the consumer adding the same field as optional but with a different type due to miscommunication) – both of which are non-breaking, and therefore, as the composite, it should also be non-breaking if the relation is transitive.

Transitivity is an important property, as it would allow us to decouple producer and consumer schema evolutions to an extent. As a producer/consumer, we could guarantee certain changes are non-breaking by only comparing the new schema with the previous version of itself, without needing to know anything about how the messages are consumed/produced⁸. More explicitly, if a producer transitions from schema \(A\) to \(A’\) then we could guarantee this is not a breaking change if \(A’ \lesssim A\), since for all consumers \(B\) we have \(A \lesssim B\) and therefore by transitivity we would have \(A’ \lesssim B\). Similarly for consumers, if a consumer transitions schema from \(B\) to \(B’\) we could guarantee this is not a breaking change if \(B \lesssim B’\) since for all producers we know \(A \lesssim B\), and therefore by transitivity would have \(A \lesssim B’\).

So which is more important? It’s a difficult one. We have to pick between strictness and mathematical elegance. Personally I think strict deserialization of optional fields is too important to sacrifice⁹, and since the non-transitive case is very rare we can still for all intents and purposes pretend that the subschema relation is transitive; we’re just aware in the back of our minds that there is an edge case.

The grammar for JSON is defined here. For the purpose of this article it really only matters that JSON is a representation of a hierarchical data structure; the syntax doesn’t matter and as such it could apply to equivalent grammars such as YAML. ↩
I mentioned this in my post Classifying Software Testing, where I interpreted static types as a form of software testing that targets the ‘arrows’ between ‘code services’. ↩
For concreteness it can help to have a JSON schema spec in mind such as the OpenAPI Specification, or alternatively the Apache Avro Spec. ↩
A relation \(\sim\) is ‘transitive’ iff whenever \(A \sim B\) and \(B \sim C\) then \(A \sim C\). If the subschema relation was transitive, then because it is also reflexive, it would be a preorder, and this symbol \(\lesssim\) is commonly used to denote preorder relations. ↩
Given that \(S(A)\) represents the set of JSON that can be serialized by a given schema \(A\), then the set of JSON that can be deserialized by a given schema \(B\), \(D(B)\), is given by: \(\{\text{json} | \text{there exists a schema A such that A is a subschema of B and json is in S(A)}\}\). The proof in the \(\impliedby\) direction is trivial, and for the \(\implies\) direction it requires construction. ↩
In this table, and in the article in general, you can read “optional” as short hand for a “field with default”. You can also interpret optional as the special case of a field with default, where the field may take the value null and where null is the default value. This is the most common use case for fields with default (and arguably fields with default should be restricted to this use case). ↩
Note that in Avro schema evolution, this is one of the allowed changes, which is something I disagree with as it breaks transitivity. It breaks transitivity because, if making a “field with default” required is an allowed change for a consumer schema, then since adding a field with default is also allowed, then transitivity would require the composite – adding a required field – to be allowed, which is clearly breaking. This is the Github issue where the issue of non-transitivity of Avro compatibility was first raised. There is not a problem with excluding this from the subschema relation, as long as we interpret \(S(A)\) – the set of JSON that can be serialized by a given schema \(A\) – more widely and allow fields with defaults not to be present in the serialization. ↩
This is particularly important for the server in a client-server relationship as there are usually multiple clients to one server that aren’t always kept track of. For the client though it might make more sense to test directly against the server schema. ↩
I think most libraries to do with serialization/deserialization do now agree on this approach – the derived deserializers in the Scala Circe library are a good model example. However I have seen inconsistencies in the past, and in current approaches. For example Avro allows incompatibly changing the type of an Enum with a default value, as demonstrated in this Github gist. However it wouldn’t allow incompatibly changing an optional field as unions are treated differently (which does make me wonder why the Scala Vulcan library doesn’t derive option deserializers with a default null, as originally I thought that was the reason). Also in the Scala Json4s library, before the introduction of the TreatAsAbsent case in the nullExtractionStrategy, it wasn’t possible to simultaneously disallow nulls and have strict option parsing (as it would fail to parse a JSON null as an Option). ↩

The ‘Subschema’ Relation on JSON Schemas

Like this:

Leave a ReplyCancel reply

Categories

Subscribe to Blog via Email