Home
json-everything
Cancel

.Net Decimals are Weird

I’ve discovered another odd consequence of what is probably fully intentional code: 4m != 4.0m. Okay, that’s not strictly true, but it does seem so if you’re comparing the values in JSON. 1 2 3 4 5 6 7 8 var a = 4m; var b = 4.0m; JsonNode jsonA = a; JsonNOde jsonB = b; // use .IsEquivalentTo() from Json.More.Net Assert.True(jsonA.IsEquivalentTo(jsonB)); // fails! What?! This took me so long to find… What’s happening (brother) The main insight is contained in this StackOverflow answer. decimal has the ability to retain significant digits! Even if those digits are expressed in code!! So when we type 4.0m in C# code, the compiler tells System.Decimal that the .0 is important. When the value is printed (e.g. via .ToString()), even without specifying a format, you get 4.0 back. And this includes when serializing to JSON. If you debug the code above, you’ll see that a has a value of 4 while b has a value of 4.0. Even before it gets to the JsonNode assignments. While this doesn’t affect numeric equality, it could affect equality that relies on the string representation of the number (like in JSON). How this bit me In developing a new library for JSON-e support (spoiler, I guess), I found a test that was failing, and I couldn’t understand why. I won’t go into the full details here, but JSON-e supports expressions, and one of the tests has the expression 4 == 3.2 + 0.8. Simple enough, right? So why was I failing this? When getting numbers from JSON throughout all of my libraries, I chose to use decimal because I felt it was more important to support JSON’s arbitrary precision with decimal’s higher precision rather than using double for a bit more range. So when parsing the above expression, I get a tree that looks like this: 1 2 3 4 5 == / \ 4 + / \ 3.2 0.8 where each of the numbers are represented as JsonNodes with decimals underneath. When the system processes 3.2 + 0.8, it gives me 4.0. As I said before, numeric comparisons between decimals work fine. But in these expressions, == doesn’t compare just numbers; it compares JsonNodes. And it does so using my .IsEquivalentTo() extension method, found in Json.More.Net. What’s wrong with the extension? When I built the extension method, I already had one for JsonElement. (It handles everything correctly, too.) However JsonNode doesn’t always store JsonElement underneath. It can also store the raw value. This has an interesting nuance to the problem in that if the JsonNodes are parsed: 1 2 3 4 var jsonA = JsonNode.Parse("4"); var jsonB = JsonNode.Parse("4.0"); Assert.True(jsonA.IsEquivalentTo(jsonB)); the assertion passes because parsing into JsonNode just stores JsonElement, and the comparison works for that. So instead of rehashing all of the possibilities of checking strings, booleans, and all of the various numeric types, I figured it’d be simple enough to just .ToString() the node and compare the output. And it worked… until I tried the expression above. For 18 months it’s worked without any problems. Such is software development, I suppose. It’s fixed now So now I check explicitly for numeric equality by calling .GetNumber(), which checks all of the various .Net number types returns a decimal? (null if it’s not a number). There’s a new Json.More.Net package available for those impacted by this (I didn’t receive any reports). And that’s the story of how creating a new package to support a new JSON functionality showed me how 4 is not always 4. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Interpreting JSON Schema Output

Cross-posting from the JSON Schema Blog. I’ve received a lot of questions (and purported bugs) and had quite a few discussions over the past few years regarding JSON Schema output, and by far the most common is, “Why does my passing validation contain errors?” Let’s dig in. No Problem Before we get into where the output may be confusing, let’s have a review of a happy path, where either all of the child nodes are valid, so the overall validation is valid, or one or more of the child nodes is invalid, so the overall validation is invalid. These cases are pretty easy to understand, so it serves as a good place to start. 1 2 3 4 5 6 7 8 9 10 { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://json-schema.org/blog/interpreting-output/example1", "type": "object", "properties": { "foo": { "type": "boolean" }, "bar": { "type": "integer" } }, "required": [ "foo" ] } This is a pretty basic schema, where this is a passing instance: 1 { "foo": true, "bar": 1 } with the output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "", "annotations": { "properties": [ "foo", "bar" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": true, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar" } ] } All of the subschema output nodes in /details are valid, and the root is valid, and everyone’s happy. Similarly, this is a failing instance (because bar is a string): 1 { "foo": true, "bar": "value" } with the output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 { "valid": false, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "", "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar", "errors": { "type": "Value is \"string\" but should be \"integer\"" } } ] } The subschema output at /details/1 is invalid, and the root is invalid, and while we may be a bit less happy because it failed, we at least understand why. So is that always the case? Can a subschema that passes validation have failed subschemas? Absolutely! More Complexity There are limitless ways that we can create a schema and an instance that pass it while outputting a failed node. Pretty much all of them have to do with keywords that present multiple options (anyOf or oneOf) or conditionals (if, then, and else). These cases, specifically, have subschemas that are designed to fail while still producing a successful validation outcome. For this post, I’m going to focus on the conditional schema below, but the same ideas pertain to schemas that contain “multiple option” keywords. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://json-schema.org/blog/interpreting-output/exmaple2", "type": "object", "properties": { "foo": { "type": "boolean" } }, "required": ["foo"], "if": { "properties": { "foo": { "const": "true" } } }, "then": { "required": ["bar"] }, "else": { "required": ["baz"] } } This schema says that if foo is true, we also need a bar property, otherwise we need a baz property. Thus, both of the following are valid: 1 { "foo": true, "bar": 1 } 1 { "foo": false, "baz": 1 } When we look at the validation output for the first instance, we get output that resembles the happy path from the previous section: all of the output nodes have valid: true, and everything makes sense. However, looking at the validation output for the second instance (below), we notice that the output node for the /if subschema has valid: false. But the overall validation passed. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/if", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/if", "instanceLocation": "", "details": [ { "valid": false, "evaluationPath": "/if/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/if/properties/foo", "instanceLocation": "/foo", "errors": { "const": "Expected \"\\\"true\\\"\"" } } ] }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/else", "instanceLocation": "" } ] } How can this be? Output Includes Why Often more important than the simple result that an instance passed validation is why it passed validation, especially if it’s not the expected outcome. In order to support this, it’s necessary to include all relevant output nodes. If we exclude the failed output nodes from the result, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/properties/foo", "instanceLocation": "/foo" }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/else", "instanceLocation": "" } ] } we see that the /else subschema was evaluated, from which we can infer that the /if subschema MUST have failed. However, we have no information as to why it failed because that subschema’s output was omitted. But looking back at the full output, it’s clear that the /if subschema failed because it expected foo to be true. For this reason, the output must retain the nodes for all evaluated subschemas. It’s also important to note that the specification states that the if keyword doesn’t directly affect the overall validation result. A Note About Format Before we finish up, there is one other aspect of reading output that can be important: format. All of the above examples use the Hierarchical format (formerly Verbose). However, depending on your needs and preferences, you may want to use the List format (formerly Basic). Here’s the output from the simple schema in List format: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "valid": false, "details": [ { "valid": false, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "" }, { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar", "errors": { "type": "Value is \"string\" but should be \"integer\"" } } ] } This is easy to read and process because all of the output nodes are on a single level. To find errors, you just need to scan the nodes in /details for any that contain errors. Here’s the output from the conditional schema in List format: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 { "valid": true, "details": [ { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] } }, { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/if", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/if", "instanceLocation": "" }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/else", "instanceLocation": "" }, { "valid": false, "evaluationPath": "/if/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/if/properties/foo", "instanceLocation": "/foo", "errors": { "const": "Expected \"\\\"true\\\"\"" } } ] } Here, it becomes obvious that we can’t just scan for errors because we have to consider where those errors are coming from. The error in the last output node only pertains to the /if subschema, which (as mentioned before) doesn’t affect the validation result. Wrap-up JSON Schema output gives you all of the information that you need in order to know what the validation result is and how an evaluator came to that result. Knowing how to read it, though, takes understanding of why all the pieces are there. If you have any questions, feel free to ask on the JSON Schema Slack workspace or open a discussion. All output was generated using my online evaluator https://json-everyting.net/json-schema. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Exploring Code Generation with JsonSchema.Net.CodeGeneration

About a month ago, my first foray into the world of code generation was published with the extension library JsonSchema.Net.CodeGeneration. For this post, I’d like to dive into the process a little to show how it works. Hopefully, this will give better insight on how to use it as well. This library currently serves as an exploration platform for the JSON Scheam IDL Vocab work, which aims to create a new vocabulary designed to help support translating between code and schemas (both ways). Extracting type information The first step in the code generation process is determining what the schema is trying to model. This library uses a complex set of mini-meta-schemas to identify supported patterns. A meta-schema is just a schema that validates another schema. For example, in most languages, enumerations are basically just named constants. The ideal JSON Schema representation of this would be a schema with an enum. So .Net’s System.DayOfWeek enum could be modelled like this: 1 2 3 4 { "title": "DayOfWeek", "enum": [ "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ] } To identify this schema as defining an enumeration, we’d need a meta-schema that looks like this: 1 2 3 4 5 6 7 8 9 10 { "type": "object", "title": "MyEnum", "properties": { "enum": { "type": "array" } }, "required": [ "enum" ] } However, in JSON Schema, an enum item can be any JSON value, whereas most languages require strings. So, we also want to ensure that the values of that enum are strings. 1 2 3 4 5 6 7 8 9 10 { "type": "object", "title": "MyEnum", "properties": { "enum": { "items": { "type": "string" } } }, "required": [ "enum" ] } We don’t need to include type or uniqueItems because we know the data is a schema, and its meta-schema (e.g. Draft 2020-12) already has those constraints. We only need to define constraints on top of what the schema’s meta-schema defines. Now that we have the idea, we can expand this by defining mini-meta-schemas for all of the patterns we want to support. There are some that are pretty easy, only needing the type keyword: string number integer boolean And there are some that are a bit more complex: arrays dictionaries custom objects (inheritable and non-inheritable) And we also want to be able to handle references. The actual schemas that were used are listed in the docs. As with any documentation, I hope to keep these up-to-date, but short of that, you can always look at the source. Building type models Now that we have the different kinds of schemas that we want to support, we need to represent them in a sort of type model from which we can generate code. The idea behind the library was to be able to generate multiple code writers that could support just about any language, so .Net’s type system (i.e. System.Type) isn’t quite the right model. The type model as it stands has the following: TypeModel - Serves as a base class for the others while also supporting our simple types. This basically just exposes a type name property. EnumModel - Each value has a name and an integer value derived from the item’s index. ArrayModel - Exposes a property to track the item type. DictionaryModel - Exposes properties to track key and item types. ObjectModel - Handles both open and closed varieties. Each property has a name, a type, and whether it can read/write. Whenever we encounter a subschema or a reference, that represents a new type for us to generate. Lastly, in order to avoid duplication, we set up some equality for type models. With this all of the types supported by this library can be modelled. As more patterns are identified, this modelling system can be expanded as needed. Writing code The final step for code generation is the part everyone cares about: actually writing code. The library defines ICodeWriter which exposes two methods: TransformName() - Takes a JSON string and transforms it into a language-compatible nme. Write() - Renders a type model into a type declaration in the language. There’s really quite a bit of freedom in how this is implemented. The built-in C# writer branches on the different model types and has private methods to handle each one. One aspect to writing types that I hadn’t thought about when I started writing the library was that there’s a difference between writing the usage of a type and writing the declaration of a type. Before, when I thought about code generation, I typically thought it was about writing type declarations: you have a schema, and you generate a class for it. But what I found was that if the properties of an object also use any of the generated types, only the type name needs to be written. For example, for the DayOfWeek enumeration we discussed before, the declaration is 1 2 3 4 5 6 7 8 9 10 public enum DayOfWeek { Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday } But if I have an array of them, I need to generate DayOfWeek[], which only really needs the type name. So my writer needs to be smart enough to write the declaration once and write just the name any time it’s used. There are a couple of other little nuance behaviors that I added in, and I encourage you to read the docs on the capabilities. Generating a conclusion Overall, writing this was an enjoyable experience. I found a simple architecture that seems to work well and is also extensible. My hope is that this library will help inform the IDL Vocab effort back in JSON Schema Land. It’s useful having a place to test things. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

The New JsonSchema.Net

Some changes are coming to JsonSchema.Net: faster validation and fewer memory allocations thanks to a new keyword architecture. The best part: unless you’ve built your own keywords, this probably won’t require any changes in your code. A new keyword architecture? For about the past year or so, I’ve had an idea that I’ve tried and failed to implement several times: by performing static analysis of a schema, some work can be performed before ever getting an instance to validate. That work can then be saved and reused across multiple evaluations. For example, with this schema 1 2 3 4 5 6 7 { "type": "object", "properties": { "foo": { "type": "string" }, "bar": { "type": "number" } } } we know: that the instance must be an object if that object has a foo property, its value must be a string if that object has a bar property, its value must be a number These are the constraints that this schema applies to any instance that it validates. Each constraint is comprised of an instance location and a requirement for the corresponding value. What’s more, most of the time, we don’t need the instance to identify these constraints. This is the basic idea behind the upcoming JsonSchema.Net v5 release. If I can capture these constraints and save them, then I only have to perform this analysis once. After that, it’s just applying the constraints to the instance. Architecture overview With the upcoming changes, evaluating an instance against a schema occurs in two phases: gathering constraints, and processing individual evaluations. For the purposes of this post, I’m going to refer to the evaluation of an individual constraint as simply an “evaluation.” Collecting constraints There are two kinds of constraints: schema and keyword. A schema constraint is basically a collection of keyword constraints, but it also needs to contain some things that are either specific to schemas, such as the schema’s base URI, or common to all the local constraints, like the instance location. A keyword constraint, in turn, will hold the keyword it represents, any sibling keywords it may have dependencies on, schema constraints for any subschemas the keyword defines, and the actual evaluation function. I started with just the idea of a generic “constraint” object, but I soon found that the two had very different roles, so it made sense to separate them. I think this was probably the key distinction from previous attempts that allowed me to finally make this approach work. So for constraints we have this recursive definition that really just mirrors the structural definition represented by JsonSchema and the different keyword classes. The primary difference between the constraints and the structural definition is that the constraints are more generic (implemented by two types) and evaluation-focused, whereas the structural definition is the more object-oriented model and is used for serialization and other things. Building a schema constraint consists of building constraints for all of the keywords that (sub)schema contains. Each keyword class knows how to build the constraint that should represent it, including getting constraints for subschemas and identifying keyword dependencies. Once we have the full constraint tree, we can save that in the JsonSchema object and reuse that work for each evaluation. Evaluation Each constraint object produces an associated evaluation object. Again, there are two kinds: one for each kind of constraint. When constructing a schema evaluation, we need the instance (of course), the evaluation path, and any options to apply during this evaluation. It’s important to recognize that options can change between evaluations; for example, sometimes you may or may not want to validate format. A results object for this subschema will automatically be created. Creating a schema evaluation will also call on the contained keyword constraints to build their evaluations. To build a keyword evaluation, the keyword constraint is given the schema constraint that’s requesting it, the instance location, and the evaluation path. From that, it can look at the instance, determine if the evaluation even needs to run (e.g. is there a value at that instance location?), and create an evaluation object if it does. It will also create schema evaluations for any subschemas. In this way, we get another tree: one built for evaluating a specific instance. The structure of this tree may (and often will) differ from the structure of the constraint tree. For example, when building constraints, we don’t know what properties additionalProperties will need to cover, so we build a template from which we can later create multiple evaluations: one for each property. Or maybe properties contains a property that doesn’t exist in the instance; no evaluation is created because there’s nothing to evaluate. While building constraints only happens once, building evaluations occurs every time JsonSchema.Evaluate() is called. And there was much rejoicing This a lot, and it’s a significant departure from the more procedural approach of previous versions. But I think it’s a good change overall because this new design encapsulates forethought and theory present within JSON Schema and uses that to improve performance. If you find you’re in the expectedly small group of users writing your own keywords, I’m also updating the docs, so you’ll have some help there. If you still have questions, feel free to open an issue or you can find me in Slack (link on the repo readme). I’m also planning a post for the JSON Schema blog which looks at a bit more of the theory of JSON Schema static analysis separately from the context of JsonSchema.Net, so watch for that as well. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

JsonNode's Odd API

1 2 3 4 5 6 var array = new JsonArray { ["a"] = 1, ["b"] = 2, ["c"] = 3, }; This compiles. Why does this compile?! Today we’re going to explore that. What’s wrong? In case you didn’t see it, we’re creating a JsonArray instance and initializing using key/value pairs. But arrays don’t contain key/value pairs; they contain values. Objects contain key/value pairs. 1 2 3 4 5 6 var list = new List<int> { ["a"] = 1, ["b"] = 2, ["c"] = 3, }; This doesn’t compile, as one would expect. So why does JsonArray allow this? Is the collection initializer broken? Collection initializers Microsoft actually has some really good documentation on collection initializers so I’m not going to dive into it here. Have a read through that if you like. The crux of it comes down to when collection initializers are allowed. First, you need to implement IEnumerable<T> and an .Add(T) method (apparently it also works as an extension method). This will enable the basic collection initializer syntax, like 1 var list = new List<int> { 1, 2, 3 }; But you can also enable direct-indexing initialization by adding an indexer. This lets us do thing like 1 2 3 4 5 6 var list = new List<int>(10) { [2] = 1, [5] = 2, [6] = 3 }; More commonly, you may see this used for Dictionary<TKey, TValue> initialization: 1 2 3 4 5 6 var dict = new Dictionary<string, int> { ["a"] = 1, ["b"] = 2, ["c"] = 3, } But, wait… does that mean that JsonArray has a string indexer? JsonArray has a string indexer! It sure does! You can see it in the documentation, right there under Properties. Why?! Why would you define a string indexer on an array type? Well, they didn’t. They defined it and the integer indexer on the base type, JsonNode, as a convenience for people working directly with the base type without having to cast it to a JsonArray or JsonObject first. But now, all of the JsonNode-derived types have both an integer indexer and a string indexer, and it’s really weird. It makes all of this code completely valid: 1 2 3 4 5 6 7 8 9 10 11 JsonValue number = ((JsonNode)5).AsValue(); // can't cast directly to JsonValue _ = number[5]; // compiles but will explode _ = number["five"]; // compiles but will explode JsonArray array = new() { 0, 1, 2, 3, 4, 5, 6 }; _ = array[5]; // fine _ = array["five"]; // compiles but will explode JsonObject obj = new() { ["five"] = 1 }; _ = obj[5]; // compiles but will explode _ = obj["five"]; // fine Is this useful? This seems like a very strange API design decision to me. I don’t think I’d ever trust a JsonNode enough to confidently attempt to index it before checking to see if it can be indexed. Furthermore, the process of checking whether it can be indexed can easily result in a correctly-typed variable. 1 2 if (node is JsonArray array) Console.WriteLine(array[5]); This will probably explode because I didn’t check bounds, but from a type safety point of view, this is SO much better. I have no need to access indexed values directly from a JsonNode. I think this API enables programming techniques that are dangerously close to using the dynamic keyword, which should be avoided at all costs. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Correction: JSON Path vs JSON Pointer

In my post comparing JSON Path and JSON Pointer, I made the following claim: A JSON Pointer can be expressed as a JSON Path only when all of its segments are non-numeric keys. Thinking about this a bit more in the context of the upcoming JSON Path specification, I realized that this only considers JSON Path segments that have one selector. If we allow for multiple selectors, and the specification does, then we can write /foo/2/bar as: $.foo[2,'2'].bar Why this works The /2 segment in the JSON Pointer says If the value is an array, choose the item at index 2. If the value is an object, choose the value under property “2”. So to write this as a path, we just need to consider both of these options. If the value is an array, we need a [2] to select the item at index 2. If the value is an object, we need a ["2"] to select the value under property “2”. Since the value cannot be both an array and an object, having both of these selectors in a segment [2,"2"] is guaranteed not to cause duplicate selection, and we’re still guaranteed to get a single value. Caveat While this path is guaranteed to yield a single value, it’s still not considered a “Singular Path” according to the syntax definition in the specification. I raised this to the team, and we ended up adding a note to clarify. Summary A thing that I previously considered impossible turned out to be possible. I’ve added a note to the original post summarizing this as well as linking here. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Parallel Processing in JsonSchema.Net

This post wraps up (for now) the adventure of updating JsonSchema.Net to run in an async context by exploring parallel processing. First, let’s cover the concepts in JSON Schema that allow parallel processing. Then, we’ll look at what that means for JsonSchema.Net as well as my experience trying to make it work. Part of the reason I’m writing this is sharing my experience. I’m also writing this to have something to point at when someone asks why I don’t take advantage of a multi-threaded approach. Parallelization in JSON Schema There are two aspects of evaluating a schema that can be parallelized. The first is by subschema (within the context of a single keyword). For those keywords which contain multiple subschemas, e.g. anyOf, properties, etc, their subschemas are independent from each other, and so evaluating them simultaneously won’t affect the others’ outcomes. These keywords then aggregate the results from their subschemas in some way: anyOf ensures that at least one of the subschemas passed (logical OR). This can be short-circuited to a passing validation when any subschema passes. allOf ensures that all of the subschemas passed (logical AND). This can be short-circuited to a failing validation when any subschema fails. properties and patternProperties map subschemas to object-instance values by key and ensures that those values match the associated subschemas (logical AND, but only for those keys which match). These can also be short-circuited to a failing validation when any subschema fails. The other way schema evaluation can be parallelized is by keyword within a (sub)schema. A schema is built using a collection of keywords, each of which define a constraint. Those constraints are usually independent (e.g. type, minimum, properties, etc.), however some keywords have dependencies on other keywords (e.g. additionalProperties, contains, else, etc.). Organizing the keywords into dependency groups, and then sorting those groups so that each group’s dependencies are run before the group, we find that the keywords in each group can be run in parallel. 1. Keywords with no dependencies We start with keywords which have no dependencies. type minimum/maximum allOf/anyOf/not properties patternProperties if minContains/maxContains None of these keywords (among others) have any impact on the evaluation of the others within this group. Running them in parallel is fine. Interestingly, though, some of these, like properties, patternProperties, and if, are themselves dependencies of keywords not in this set. 2. Keywords with only dependencies on independent keywords Once we have all of the independent keywords processed, we can evaluate the next set of keywords: ones that only depend on the first set. additionalProperties (depends on properties and patternProperties) then/else (depends on if) contains (depends on minContains/maxContains) Technically, if we don’t mind processing some keywords multiple times, we can run all of the keywords in parallel. For example, we can process then and else in the first set if we process if for each of them. JsonSchema.Net seeks to process each keyword once, so it performs this dependency grouping. This then repeats, processing only those keywords which have dependencies that have already been processed. In each iteration, all of the keywords in that iteration can be processed in parallel because their dependencies have completed. The last keywords to run are unevaluatedItems and unevaluatedProperties. These keywords are special in that they consider the results of subschemas in any adjacent keywords, such as allOf. That means any keyword, including keywords defined in third-party vocabularies, are dependencies of these two. Running them last ensures that all dependencies are met. Parallelization in JsonSchema.Net For those who wish to see what this ended up looking like, the issue where I tracking this process is here and the final result of the branch is here. (Maybe someone looking at the changes can find somewhere I went wrong. Additional eyes are always welcome.) Once I moved everything over to async function calls, I started on the parallelization journey by updating AllOfKeyword for subschema parallelization. In doing this, I ran into my first conundrum. The evaluation context Quite a long time ago, in response to a report of high allocations, I updated the evaluation process so that it re-used the evaluation context. Before this change, each subschema evaluation (and each keyword evaluation) would create a new context object based on information in the “current” context, and then the results from that evaluation would be copied back into the “current” context as necessary. The update changed this processes so that there was a single context that maintained a series of stacks to track where it was in the evaluation process. A consequence of this change, however, was that I could only process serially because the context indicated one specific evaluation path at a time. The only way to move into a parallel process (in which I needed to track multiple evaluation paths simultaneously) was to revert at least some of that allocation management, which meant more memory usage again. I think I figured out a good way to do it without causing too many additional allocations by only creating a new context when multiple branches were possible. So that means any keywords that have one a single subschema would continue to use the single context, but any place where the process could branch would create new contexts that only held the top layer of the stacks from the parent context. I updated all of the keywords to use this branching strategy, and it passed the test suite, but for some reason it ran slower. Sync Method optimized Mean Error StdDev Gen0 Gen1 Gen2 Allocated RunSuite False 874.0 ms 13.53 ms 12.65 ms 80000.0000 19000.0000 6000.0000 178.93 MB RunSuite True 837.3 ms 15.76 ms 14.74 ms 70000.0000 22000.0000 8000.0000 161.82 MB Async Method optimized Mean Error StdDev Gen0 Gen1 Gen2 Allocated RunSuite False 1.080 s 0.0210 s 0.0206 s 99000.0000 29000.0000 9000.0000 240.26 MB RunSuite True 1.050 s 0.0204 s 0.0201 s 96000.0000 29000.0000 9000.0000 246.53 MB Investigating this led to some interesting discoveries. Async is not always parallel My first thought was to check whether evaluation was utilizing all of the processor’s cores. So I started up my Task Manager and re-ran the benchmark. Performance tab of the Task Manager during a benchmark run. One core is pegged out completely, and the others are unaffected. That’s not parallel. A little research later, and it seems that unless you explicitly call Task.Run(), a task will be run on the same thread that spawned it. Task.Run() tells .Net to run the code on a new thread. So I updated all of the keywords again to create new threads. Things get weird Before I ran the benchmark again, I wanted to run the test suite to make sure that the changes I made still actually evaluated schemas properly. After all, what good is running really fast if you’re going the wrong direction? Of the 7,898 tests that I run from the official JSON Schema Test Suite, about 15 failed. That’s not bad, and it usually means that I have some data mixed up somewhere, a copy/paste error, or something like that. Running each test on its own, though, they all passed. Running the whole suite again, and 17 would fail. Running all of the failed tests together, and they would all pass. Running the the suite again… 12 failed. Each time I ran the full, it was a different group of less than 20 tests that would fail. And every time, they’d pass if I ran them in isolation or in a smaller group. This was definitely a parallelization problem. I added some debug logging to see what the context was holding. Eventually, I found that for the failed tests, the instance would inexplicably delete all of its data. Here’s some of that logging: 1 2 3 4 5 6 starting /properties - instance root: {"foo":[1,2,3,4]} (31859421) starting /patternProperties - instance root: {"foo":[1,2,3,4]} (31859421) returning /patternProperties - instance root: {} (31859421) returning /properties - instance root: {} (31859421) starting /additionalProperties - instance root: {} (31859421) returning /additionalProperties - instance root: {} (31859421) The “starting” line was printed immediately before calling into a keyword’s .Evaluate() method, and the “returning” line was called immediately afterward. The parenthetical numbers afterward are the hash code (i.e. .GetHashCode()) of the JsonNode object, so you can see that it’s the same object, only the contents are missing. None of my code edits the instance: all access is read only. So I have no idea how this is happening. A few days ago, just by happenstance, this dotnet/runtime PR was merged, which finished off changes in this PR from last year, which resolved multi-threading issues in JsonNode… that I reported! I’m not sure how that slipped by me while working on this. This fix is slated to be included in .Net 8. I finally figure out that if I access the instance before (or immediately after) entering each thread, then it seems to work, so I set about making edits to do that. If the instance is a JsonObject or JsonArray, I simply access the .Count property. This is the simplest and quickest thing I could think to do. That got all of the tests working. Back to our regularly scheduled program With the entire test suite now passing every time I ran it, I wanted to see how we were doing on speed. I once again set up the benchmark and ran it with the Task Manager open. Performance tab of the Task Manager during a benchmark run with proper multi-threading. The good news is that we’re actually multi-threading now. The bad news is that the benchmark is reporting that the test takes twice as long as synchronous processing and uses a lot more memory. Method optimized Mean Error StdDev Gen0 Gen1 Allocated RunSuite False 1.581 s 0.0128 s 0.0120 s 130000.0000 39000.0000 299.3 MB RunSuite True 1.681 s 0.0152 s 0.0135 s 134000.0000 37000.0000 309.65 MB I don’t know how this could be. Maybe touching the instance causes a re-initialization that’s more expensive than I expect. Maybe spawning and managing all of those threads takes more time than the time saved by running the evaluations in parallel. Maybe I’m just doing it wrong. The really shocking result is that it’s actually slower when “optimized.” That is, taking advantage of short-circuiting when possible by checking for the first task that completed with a result that matched a predicate, and then cancelling the others. (My code for this was basically re-inventing this SO answer.) Given this result, I just can’t see this library moving into parallelism anytime soon. Maybe once .Net Framework is out of support, and I move it into the newer .Net era (which contains the threading fix) and out of .Net Standard (which won’t ever contain the fix), I can revisit this. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

The "Try" Pattern in an Async World

Something I ran across while converting JsonSchema.Net from synchronous to asynchronous is that the “try” method pattern doesn’t work in an async context. This post explores the pattern and attempts to explain what happens when we try make the ​method async. What is the “try” method pattern? We’ve all seen various TryParse() methods. In .Net, they’re on pretty much any data type that has a natural representation as a string, typically numbers, dates, and other simple types. When we want to parse that string into the type, we might go for a static parsing method which returns the parsed value. For example, 1 static int Parse(string s) { /* ... */ } The trouble with these methods is that they throw exceptions when the string doesn’t represent the type we want. If we don’t want the exception, we could wrap the Parse() call in a try/catch, but that will incur exception handling costs that we’d like to avoid. The answer is to use another static method that has a slightly different form: 1 static bool TryParse(string s, out int i) { /* ... */ } Here, the return value is a success indicator, and the parsed value is passed as an out parameter. If the parse was unsuccessful, the value in the out parameter can’t be trusted (it will still have a value, though, usually the default for the type). Ideally, this method does more than just wrapping Parse() in a try/catch for you. Instead, it should reimplemented the parsing logic to not throw an exception in the first place. However, calling TryParse() from Parse() and throwing on a failure is the ideal setup for this pair of methods if you want to re-use logic. This pattern is very common for parsing, but it can be used for other operations as well. For example, JsonPointer.Net uses this pattern for evaluating JsonNode instances because of .Net’s decision to unify .Net-null and JSON-null. There needs to be a distinction between “the value doesn’t exist” and “the value was found and is null,” and a .TryEvaluate() method allows this. Why would I need to make this pattern async? As I mentioned in the intro, I came across this when I was converting JsonSchema.Net to async. Specifically, the data keyword implementation uses a set of resolvers to locate the data that is being referenced. Those resolvers implement an interface that defines a .TryResolve() method. 1 bool TryResolve(EvaluationContext context, out JsonNode? node); I have a resolver for JSON Pointers, Relative JSON Pointers, and URIs. Since the entire point of this change was to make URI resolution async, I now have to make this “try” method async. Let’s make the pattern async To make any method support async calls, its return type needs to be a Task. In the case of .TryParse() it needs to return Task<bool>. 1 Task<bool> TryResolve(EvaluationContext context, out JsonNode? node); No problems yet. Let’s go to one of the resolvers and tag it with async so that we can use await for the resolution calls. Oh… that’s not going to work. Since we can’t have out parameters for async methods, we have two options: Implement the method without using async and await. Get the value out another way. I went with the second solution. 1 async Task<(bool, JsonNode?)> TryResolve(EvaluationContext context) { /* ... */ } This works perfectly fine: it gives a success output and a value output. Hooray for tuples in .Net! Later, I started thinking about why out parameters are forbidden in async methods. Why are out parameters forbidden in async methods? Without going into too much detail, when you have an async method, the compiler is actually doing a few transformations for you. Specifically it has to transform your method that looks like it’s returning a bool into one that returns a Task<bool>. This async method 1 2 3 4 5 6 7 async Task<bool> SomeAsyncMethod() { // some stuff await AnotherAsyncMethod(); // some other stuff return true; } essentially becomes 1 2 3 4 5 6 7 8 9 10 Task<bool> SomeAsyncMethod() { // some stuff return Task.Run(AnotherAsyncMethod) .ContinueWith(result => { // some other stuff return true; }); } There are a few other changes and optimizations that happen, but this is the general idea. So when we add an out parameter, 1 2 3 4 5 6 7 8 9 10 Task<bool> SomeAsyncMethod(out int value) { // some stuff return Task.Run(AnotherAsyncMethod) .ContinueWith(result => { // some other stuff return true; }); } it needs to be set before the method returns. That means it can only be set as part of // some stuff. But in the async version, it’s not apparent that value has to be set before anything awaits, so they just forbid having the out parameter in async methods altogether. In the context of my .TryResolve() method, I’d have to set the out parameter before I fetch the URI content, but I can’t do that because the URI content is what goes in the out parameter. Given this new information, it seems the first option of implementing the async method without async/await really isn’t an option. A new pattern While I found musing over the consequences of out parameters in async methods interesting, I think the more significant outcome from this experience is finding a new version of the “try” pattern. 1 2 3 4 Task<(bool, ResultType)> TrySomethingAsync(InputType input) { // ... } It’s probably a pretty niche need, but I hope having this in your toolbox helps you at some point. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

JSON Schema, But It's Async

The one thing I don’t like about how I’ve set up JsonSchema.Net is that SchemaRegistry.Fetch only supports synchronous methods. Today, I tried to remedy that. This post is a review of those prospective changes. For those who’d like to follow along, take a look at the commit that is the fallout of this change. Just about every line in this diff is a required, direct consequence of just making SchemaRegistry.Fetch async. What is SchemaRegistry? Before we get into the specific change and why we need it, we need to cover some aspects of dereferencing the URI values of keywords like $ref. The JSON Schema specification states … implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI. That means that, to be compliant with the specification, we need some sort of registry to preload any documents that are externally referenced by schemas. This text addresses the specification’s responsibility around the many security concerns that arise as soon as you require implementations to reach out to the network. By recommending against this activity, the specification avoids those concerns and passes them onto the implementations that, on their own, wish to provide that functionality. JsonSchema.Net is one of a number of implementations that can be configured to perform these “network operations” when they encounter a URI they don’t recognize. This is acceptable to the specification because it is opt-in. In JsonSchema.Net this is accomplished using the SchemaRegistry.Fetch property. By not actually defining a method in the library, I’m passing on those security responsibilities to the user. I actually used to use it to run the test suite. Several of the tests reference external documents through a $ref value that starts with http://localhost:1234/. The referenced documents, however, are just files stored in a dedicated directory in the suite. So in my function, I replaced that URI prefix with the directory, loaded the file, and returned the deserialized schema. Now I just pre-load them all to help the suite run a bit faster. SchemaRegistry.Fetch is declared as an instance property of type Func<Uri, IBaseDocument?>. Really, this acts as a method that fetches documents that haven’t been pre-registered. Declaring it as a property allows the user to define their own method to perform this lookup. As this function returns an IBaseDocument?, it’s synchronous. Why would we want this to be async? The way to perform a network operation in .Net is by creating an HttpClient and calling one of its methods. Funnily, though, all of those methods are… async. One could create a quasi-synchronous method that makes the call and waits for it. 1 2 3 4 5 6 7 8 9 IBaseDocument? Download(Uri uri) { using var client = new HttpClient(); var text = client.GetAsStringAsync(uri).Result; if (text == null) return null; return JsonSchema.FromText(text); } but that isn’t ideal, and, in some contexts, it’s actively disallowed. Attempting to access a task’s .Result in Blazor Web Assembly throws an UnsupportedException, which is why json-everything.net doesn’t yet support fetching referenced schemas, despite it being online, where fetching such documents automatically might be expected. So we need the SchemaRegistry.Fetch property to support an async method. We need it to be of type Func<Uri, Task<IBaseDocument?>>. Then our method can look like this 1 2 3 4 5 6 7 8 9 async Task<IBaseDocument?> Download(Uri uri) { using var client = new HttpClient(); var text = await client.GetAsStringAsync(uri); if (text == null) return null; return JsonSchema.FromText(text); } Making the change Changing the type of the property is simple enough. However this means that everywhere that the function is called now needs to be within an async method… and those methods also need to be within async methods… and so on. Async propagation is real! In the end, the following public methods needed to be changed to async: JsonSchema.Evaluate() IJsonSchemaKeyword.Evaluate() and all of its implementations, which is every keyword, including the ones in the vocabulary extensions SchemaRegistry.Register() SchemaRegistry.Get() IBaseDocument.FindSubschema() The list doesn’t seem that long like this, but there were a lot of keywords and internal methods. The main thing that doesn’t make this list, though, is the tests. Oh my god, there were so many changes in the tests! Even with the vast majority of the over 10,000 tests being part of the JSON Schema Test Suite (which really just has some loading code and a single method), there were still a lot of .Evaluate() calls to update. Another unexpected impact of this change was in the validating JSON converter from a few posts ago. JsonConverter’s methods are synchronous, and I can’t change them. That means I had to use .Result inside the .Read() method. That means the converter can’t be used in a context where that doesn’t work. It’s ready… … but it may be a while before this goes in. All of the tests pass, and I don’t see any problems with it, but it’s a rather large change. I’ll definitely bump major versions for any of the packages that are affected, which is effectively all of the JSON Schema packages. I’ll continue exploring a bit to see what advantages an async context will bring. Maybe I can incorporate some parallelism into schema evaluation. We’ll see. But really I want to get some input from users. Is this something you’d like to see? Does it feel weird at all to have a schema evaluation be async, even if you know you’re not making network calls? How does this impact your code? Leave some comments below or on this issue with your thoughts. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Numbers Are Numbers, Not Strings

A common practice when serializing to JSON is to encode floating point numbers as strings. This is done any time high precision is required, such as in the financial or scientific sectors. This approach is designed to overcome a flaw in many JSON parsers across multiple platforms, and, in my opinion, it’s an anti-pattern. Numbers in JSON The JSON specification (the latest being RFC 8259 as of this writing) does not place limits on the size or precision of numbers encoded into the format. Nor does it distinguish between integers or floating point. That means that if you were to encode the first million digits of π as a JSON number, that precision would be preserved. Similarly, if you were to encode 85070591730234615847396907784232501249, the square of the 64-bit integer limit, it would also be preserved. They are preserved because JSON, by its nature as a text format, encodes numeric values as decimal strings. The trouble starts when you try to get those numbers out via parsing. It should also be noted that the specification does have a couple paragraphs regarding support for large and high-precision numbers, but that does not negate the “purity” of the format. This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available. Note that when such software is used, numbers that are integers andare in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values. The problem with parsers Mostly, parsers are pretty good, except when it comes to numbers. An informal, ad-hoc survey conducted by the engineers at a former employer of mine found that the vast majority of parsers in various languages automatically parse numbers into their corresponding double-precision (IEEE754) floating point representation. If the user of that parsed data wants the value in a more precise data type (e.g. a decimal or bigint), that floating point value is converted into the requested type afterward. But at that point, all of the precision stored in the JSON has already been lost! In order to properly get these types out of the JSON, they must be parsed directly from the text. My sad attempt at repeating the survey Perl will at least give you the JSON text for the number if it can’t parse the number into a common numeric type. This lets the consumer handle those cases. It also appears to have some built-in support for bignum. A JSON number becomes either an integer, numeric (floating point) or string scalar in perl, depending on its range and any fractional parts. Javascript actually recommends the anti-pattern for high-precision needs. … numbers in JSON text will have already been converted to JavaScript numbers, and may lose precision in the process. To transfer large numbers without loss of precision, serialize them as strings, and revive them to BigInts, or other appropriate arbitrary precision formats. Go (I played researched online) parses a bigint number as floating point and truncates high-precision decimals. There’s even an alternative parser that behaves the same way. Ruby only supports integers and floating point numbers. PHP (search for “Example #5 json_decode() of large integers”) appears to operate similarly to Perl in that it can give output as a string for the consumer to deal with. .Net actually stores the tokenized value (_parsedData) and then parses it upon request. So when you ask for a decimal (via .GetDecimal()) it actually parses that data type from the source text and gives you what you want. 10pts for .Net! This is why JsonSchema.Net uses decimal for all non-integer numbers. While there is a small sacrifice on range, you get higher precision, which is often more important. It appears that many languages support dynamically returning an appropriate data type based on what’s in the JSON text (integer vs floating point), which is neat, but then they only go half-way: they only support basic integer and floating point types without any support for high-precision values. Developers invent a workaround As is always the case, the developers who use these parsers need to have a solution, and they don’t want to have to build their own parser to get the functionality they need. So what do they do? They create a convention where numbers are serialized as JSON strings any time high precision is required. This way the parser gives them a string, and they can parse that back into a number of the appropriate type however they want. However, this has led to a multitude of support requests and StackOverflow questions. How do I configure the serializer to read string-encoded numbers? How do I validate string-encoded numbers? When is it appropriate or unnecessary to serialize numbers as strings? And, as we saw with the Javascript documentation, this practice is actually being recommended now! This is wrong! Serializing numbers as strings is a workaround that came about because parsers don’t do something they should. On the validation question, JSON Schema can’t apply numeric constraints to numbers that are encoded into JSON strings. They need to be JSON numbers for keywords like minimum and multipleOf to work. Where to go from here Root-cause analysis gives us the answer: the parsers need to be fixed. They should support extracting any numeric type we want from JSON numbers and at any precision. A tool should make a job easier. However, in this case, we’re trying to drive a screw with a pair of pliers. It works, but it’s not what was intended. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!