Home
json-everything
Cancel

Improving JsonSchema.Net (Part 1)

In the last two posts, I talked about the improvements to JsonPointer.Net and some of the memory management tools I used to enact those improvements. In this post, I’d like to start talking about some of the changes I made to JsonSchema.Net for v7.0.0. Rather than just showing the final result, I’ll be taking you through the journey of changing the code because I think it’s important to share the iterative development process. Designs don’t just come to us complete. We have an idea first, and through trying to implement that idea, we find and work around caveats and gotchas that eventually lead us to the final solution. Results first The benchmark runs through the JSON Schema Test Suite n times. Version n Mean Error StdDev Gen0 Gen1 Allocated v6.1.2 1 412.7 ms 14.16 ms 41.30 ms 27000.0000 1000.0000 82.66 MB v7.0.0 1 296.5 ms 5.82 ms 10.03 ms 21000.0000 4000.0000 72.81 MB v6.1.2 10 1,074.7 ms 22.24 ms 63.82 ms 218000.0000 11000.0000 476.56 MB v7.0.0 10 903.0 ms 17.96 ms 40.91 ms 202000.0000 9000.0000 443.65 MB The improvement isn’t as impressive as with JsonPointer.Net, but I’m still quite happy with it. Interestingly, the JsonPointer.Net improvements didn’t contribute as much to the overall memory usage as I thought they would. I’d say maybe half of the improvement here is just follow-on effect from JsonPointer.Net. The rest is some necessary refactoring and applying the same memory management tricks from the previous post. Target: memory management inside JsonSchema My first process for making improvements was running the test suite with Resharper’s profiler and looking at allocations. There were two areas that were causing the most pain: JsonSchema.PopulateConstraint() JsonSchema.GetSubschemas() & .IsDynamic() PopulateConstraint() The primary source for allocations was from this JsonSchema-private method, which is responsible for actually building out the schema constraint for the JsonSchema instance, including all of the constraints for the keywords and their subschemas. This is the hub for all of the static analysis. In this method, I was allocating several List<T>s and arrays that were only used within the scope of the method and then released. I also relied heavily on LINQ methods to create multiple collections to help me manage which keywords need to be evaluated (based on the schema version and dialect being used). Then I’d run through two loops, one for the keywords to process and one to collect the rest as annotations. To remove these allocations, I used the MemoryPool<T> strategy from the last post. I’ve also combined the two loops. Instead of pre-calculating the lists, I determine which keywords to process individually as I iterate over all of them. There is still a little LINQ to perform some sorting, but I’d rather leave that kind of logic to the framework. What was arguably more concise: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 // Organize the keywords into different categories - a collection per category. // Lots of allocation going on here. var localConstraints = new List<KeywordConstraint>(); var version = DeclaredVersion == SpecVersion.Unspecified ? context.EvaluatingAs : DeclaredVersion; var keywords = EvaluationOptions.FilterKeywords(context.GetKeywordsToProcess(this, context.Options), version).ToArray(); var unrecognized = Keywords!.OfType<UnrecognizedKeyword>(); var unrecognizedButSupported = Keywords!.Except(keywords).ToArray(); // Process the applicable keywords (determined by the dialect) // Strangely, this also includes any instances of UnrecognizedKeyword because // annotation collection is its normal behavior foreach (var keyword in keywords.OrderBy(x => x.Priority())) { var keywordConstraint = keyword.GetConstraint(constraint, localConstraints, context); localConstraints.Add(keywordConstraint); } // Collect annotations for the known keywords that don't need to be processed. // We have to re-serialize their values. foreach (var keyword in unrecognizedButSupported) { var typeInfo = SchemaKeywordRegistry.GetTypeInfo(keyword.GetType()); var jsonText = JsonSerializer.Serialize(keyword, typeInfo!); var json = JsonNode.Parse(jsonText); var keywordConstraint = KeywordConstraint.SimpleAnnotation(keyword.Keyword(), json); localConstraints.Add(keywordConstraint); } constraint.Constraints = [.. localConstraints]; is now: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 // Instead of creating lists, we just grab some memory from the pool. using var constraintOwner = MemoryPool<KeywordConstraint>.Shared.Rent(Keywords!.Count); var localConstraints = constraintOwner.Memory.Span; var constraintCount = 0; using var dialectOwner = MemoryPool<Type>.Shared.Rent(); var declaredKeywordTypes = dialectOwner.Memory.Span; var i = 0; // Dialect is determined when the schema is registered (see the next section), // so we know exactly which keyword types to process. if (Dialect is not null) { foreach (var vocabulary in Dialect) { foreach (var keywordType in vocabulary.Keywords) { declaredKeywordTypes[i] = keywordType; i++; } } } declaredKeywordTypes = declaredKeywordTypes[..i]; var version = DeclaredVersion == SpecVersion.Unspecified ? context.EvaluatingAs : DeclaredVersion; // Now we only run a single loop through all of the keywords. foreach (var keyword in Keywords.OrderBy(x => x.Priority())) { KeywordConstraint? keywordConstraint; if (ShouldProcessKeyword(keyword, context.Options.ProcessCustomKeywords, version, declaredKeywordTypes)) { keywordConstraint = keyword.GetConstraint(constraint, localConstraints[..constraintCount], context); localConstraints[constraintCount] = keywordConstraint; constraintCount++; continue; } // We still have to re-serialize values for known keywords. var typeInfo = SchemaKeywordRegistry.GetTypeInfo(keyword.GetType()); var json = JsonSerializer.SerializeToNode(keyword, typeInfo!); keywordConstraint = KeywordConstraint.SimpleAnnotation(keyword.Keyword(), json); localConstraints[constraintCount] = keywordConstraint; constraintCount++; constraint.UnknownKeywords?.Add((JsonNode)keyword.Keyword()); } After these changes, PopulateConstraint() is still allocating the most memory, but it’s less than half of what it was allocating before. One of the breaking changes actually came out of this update as well. IJsonSchemaKeyword.GetConstraint() used to take an IEnumerable<T> of the constraints that have already been processed, but now it takes a ReadOnlySpan<T> of them. This might impact the implementation of a custom keyword, but from my experience with the 93 keywords defined in the solution, it’s likely not going to require anything but changing the method signature since most keywords don’t rely on sibling evaluations. GetSubschemas() & IsDynamic() The second largest contributor to allocations was GetSubschemas(). This was primarily because IsDynamic() called it… a lot. IsDynamic() is a method that walks down into the schema structure to determine whether a dynamic keyword (either $recursiveRef or $dynamicRef) is used. These keywords cannot be fully analyzed statically because, in short, their resolution depends on the dynamic scope, which changes during evaluation and can depend on the instance being evaluated. Juan Cruz Viotti has an excellent post on the JSON Schema blog that covers lexical vs dynamic scope in depth. I definitely recommend reading it. IsDynamic() was a very simple recursive function: 1 2 3 4 5 6 7 private bool IsDynamic() { if (BoolValue.HasValue) return false; if (Keywords!.Any(x => x is DynamicRefKeyword or RecursiveRefKeyword)) return true; return Keywords!.SelectMany(GetSubschemas).Any(x => x.IsDynamic()); } It checks for the dynamic keywords. If they exist, return true; if not, check the keywords’ subschemas by calling GetSubschemas() on each of them. GetSubschemas() is a slightly more complicated method that checks a keyword to see if it contains subschemas and return them if it does. To accomplish this, it used yield return statements, which builds an IEnumerable<T>. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 internal static IEnumerable<JsonSchema> GetSubschemas(IJsonSchemaKeyword keyword) { switch (keyword) { // ItemsKeyword implements both ISchemaContainer and ISchemaCollector, // so it's important to make sure the Schema property is actually not null // even though the interface's nullability indicates that it's not. case ISchemaContainer { Schema: not null } container: yield return container.Schema; break; case ISchemaCollector collector: foreach (var schema in collector.Schemas) { yield return schema; } break; case IKeyedSchemaCollector collector: foreach (var schema in collector.Schemas.Values) { yield return schema; } break; case ICustomSchemaCollector collector: foreach (var schema in collector.Schemas) { yield return schema; } break; } } As implemented, these methods (in my opinion) are quite simple and elegant. However this design has a couple of glaring problems. IsDynamic makes no attempt to cache the result, even though JsonSchema is immutable and the result will never change. While yield return is great for building deferred-execution queries and definitely has its applications (JsonPath.Net actually returns deferred-execution queries), this is not one of those applications, and it does result in considerable memory allocations. I started with GetSubschemas() by converting all of the yield return statements to just collecting the subschemas into a Span<T>. This doesn’t change the method that much, and it’s actually closer to what would have been done before C# had the yield keyword. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 private static ReadOnlySpan<JsonSchema> GetSubschemas(IJsonSchemaKeyword keyword) { var owner = MemoryPool<JsonSchema>.Shared.Rent(); var span = owner.Memory.Span; int i = 0; switch (keyword) { // ReSharper disable once RedundantAlwaysMatchSubpattern case ISchemaContainer { Schema: not null } container: span[0] = container.Schema; i++; break; case ISchemaCollector collector: foreach (var schema in collector.Schemas) { span[i] = schema; i++; } break; case IKeyedSchemaCollector collector: foreach (var schema in collector.Schemas.Values) { span[i] = schema; i++; } break; case ICustomSchemaCollector collector: foreach (var schema in collector.Schemas) { span[i] = schema; i++; } break; } return i == 0 ? [] : span[..i]; } Then I started to update IsDynamic() to use the refactored GetSubschemas(). (I tried making it iterative instead of recursive, but I couldn’t do that very well without allocations, so I just stuck with the recursion.) As I was working on it, I realized that being able to just get the subschemas of an entire schema would be tidier, so I created that method as well. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 internal ReadOnlySpan<JsonSchema> GetSubschemas() { if (BoolValue.HasValue) return []; var owner = MemoryPool<JsonSchema>.Shared.Rent(); var span = owner.Memory.Span; var i = 0; foreach (var keyword in Keywords!) { foreach (var subschema in GetSubschemas(keyword)) { span[i] = subschema; i++; } } return i == 0 ? [] : span[..i]; } private bool IsDynamic() { if (BoolValue.HasValue) return false; if (_isDynamic.HasValue) return _isDynamic.Value; foreach (var keyword in Keywords!) { if (keyword is DynamicRefKeyword or RecursiveRefKeyword) { _isDynamic = true; return true; } } foreach (var subschema in GetSubschemas()) { if (subschema.IsDynamic()) { _isDynamic = true; return true; } } _isDynamic = false; return false; } This worked… barely. The tests passed, but the memory allocations skyrocketed. My benchmark wouldn’t finish because it ate all of my RAM. Some of you may see why. If you read my last post, I included a warning that Memory<T> is disposable and you need to make sure that you dispose of it. This is how I learned that lesson. My acquisition of the memory (via the .Rent() method) needs to be a using declaration (or block). 1 using var owner = MemoryPool<JsonSchema>.Shared.Rent(); But just making this change made me sad for a different reason: pretty much all of my tests failed. Then I realized the problem: making the memory a using declaration meant that the memory (and the span that comes with it) was released when the method returned. But then I’m returning the span… which was released. That’s generally bad. 1 2 3 4 5 6 7 8 9 10 11 internal ReadOnlySpan<JsonSchema> GetSubschemas() { // ... using var owner = MemoryPool<JsonSchema>.Shared.Rent(); // memory assigned var span = owner.Memory.Span; // ... return i == 0 ? [] : span[..i]; // memory released; what is returned?! } ref structs were introduced partially to solve this problem. Instead of making my method return ref ReadOnlySpan<JsonSchema>, I opted to pass in the owner from the calling method. 1 2 3 4 5 6 7 8 9 10 internal ReadOnlySpan<JsonSchema> GetSubschemas(IMemoryOwner<JsonSchema> owner) { // ... var span = owner.Memory.Span; // ... return i == 0 ? [] : span[..i]; } Now the memory is owned by the calling method, which allows that method to read the span’s contents before it’s released. This also had an added benefit that I could just rent the memory once and re-use it each time I called GetSubschemas(). Here are the final methods: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 private bool IsDynamic() { if (BoolValue.HasValue) return false; if (_isDynamic.HasValue) return _isDynamic.Value; foreach (var keyword in Keywords!) { if (keyword is DynamicRefKeyword or RecursiveRefKeyword) { _isDynamic = true; return true; } } // By renting here, we get to read the span before it's released. using var owner = MemoryPool<JsonSchema>.Shared.Rent(); foreach (var subschema in GetSubschemas(owner)) { if (subschema.IsDynamic()) { _isDynamic = true; return true; } } _isDynamic = false; return false; } internal ReadOnlySpan<JsonSchema> GetSubschemas(IMemoryOwner<JsonSchema> owner) { if (BoolValue.HasValue) return []; var span = owner.Memory.Span; // By renting here, we get to read the span before it's released. // We also get to re-use it for each keyword. using var keywordOwner = MemoryPool<JsonSchema>.Shared.Rent(); var i = 0; foreach (var keyword in Keywords!) { foreach (var subschema in GetSubschemas(keyword, keywordOwner)) { span[i] = subschema; i++; } } return i == 0 ? [] : span[..i]; } private static ReadOnlySpan<JsonSchema> GetSubschemas(IJsonSchemaKeyword keyword, IMemoryOwner<JsonSchema> owner) { var span = owner.Memory.Span; int i = 0; switch (keyword) { case ISchemaContainer { Schema: not null } container: span[0] = container.Schema; i++; break; case ISchemaCollector collector: foreach (var schema in collector.Schemas) { span[i] = schema; i++; } break; case IKeyedSchemaCollector collector: foreach (var schema in collector.Schemas.Values) { span[i] = schema; i++; } break; case ICustomSchemaCollector collector: foreach (var schema in collector.Schemas) { span[i] = schema; i++; } break; } return i == 0 ? [] : span[..i]; } These changes basically removed these methods from Resharper’s profiling analysis, meaning they’re not allocating enough to bother reporting! Wrap up During my changes to JsonPointer.Net, I had paused and transitioned to working in this library. This is where I learned the most about using Memory<T>. In the next post, I’ll go over how I de-spaghettified the schema meta-data analysis code. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Lessons in Memory Management

Last time, I took you through the developer’s journey I had while updating JsonPointer.Net and how taking time to really consider my new architecture resulted in completely overhauling the library a second time before publishing it, which yielded a much better outcome. In this post, I’d like to go over some of the more technical things I learned trying to make the library consume less memory. Please note that what I reveal in this post is not to be taken as expert advice. This is very much a new area of .Net for me, and I still have quite a bit to learn about best practices and the intended use for the memory management tools that .Net provides. Why allocate less memory? Allocating memory is making a request to the system to go out to the heap to find a block of memory that is sufficient for your task. It then has to reserve that memory, which often means negotiating with the OS. Releasing memory (in .Net and other managed frameworks) is eliminating references to an object so that the garbage collector (GC) can identify it as unused but allocated. Then it has to talk with the OS again to let it know that the block of memory is now available. In between those two operations, the GC is doing a lot to ensure that the memory that needs to stay around does so and the memory that can be reclaimed is. The biggest detractor from performance, though, is that in order to do any of this, it has to essentially pause the application. And it does this a lot. All of this takes time. So the general concept is: fewer allocations means less for the GC to do during the pause, which resumes your application faster. The internet is full of “how garbage collection works in .Net” posts, so I’m not going to cover that. The above is a sufficient understanding to convey why allocating less improves performance. What types allocate memory? Most of the types we use allocate memory on the heap. If it’s a class, it lives on the heap. A struct, if it’s just a variable, parameter, or return value, will generally live on the stack, but there are exceptions. A struct as a member of any data that’s on the heap will also be on the heap. Think of an int field inside of a class, like List<T>.Count. Arrays and pretty much all collection types are classes, so they live on the heap, even if they’re comprised of structs. So int[] lives on the heap. This is a typical entry-level .Net developer interview question. When talking about reducing allocations, we’re generally talking about heap allocations because that’s the stuff that the GC has to take time to manage. In my first refactor for JsonPointer.Net, I made JsonPointer a struct, thinking it would allocate less memory by living on the stack. What I failed to realize was that inside the pointer, I was still holding a string (which is a class) and a Range[] (which is also a class). So while the pointer itself lived on the stack, it still contained two fields which pointed to heap memory, and allocating a new JsonPointer still allocated heap memory for the fields. Making the container a struct in order to save an allocation is like taking a spoonful of water out of a rising river in order to prevent a flood, but then advertising that you’re being helpful. Enhancement #1 - Don’t use extra objects As mentioned in the previous post, JsonPointer was implemented as a container of PointerSegments, which itself was a container of strings that also handled JSON Pointer escaping. As far as data is concerned, PointerSegment isn’t adding any value. It doesn’t collect multiple related pieces of data together; it only has one piece of data. So I’ve removed it from the model. That means JsonPointer directly holds a collection of strings, and I need to move all of the encoding logic either into JsonPointer or into extension methods. (They’re internal so it doesn’t really matter where.) That’s pretty much the enhancement: get rid of parts of your data model that you don’t need. But completely eliminating PointerSegment presents a small problem. JsonPointer declares a .Create() method that takes a parameterized array of segments, and those segments can either be strings or integers, interchangeably. 1 2 var pointer1 = JsonPointer.Create("foo", 3, "bar"); var pointer2 = JsonPointer.Create( 5, "foo", "bar"); If C# had union types I could easily just declare the parameter type to be a union of string and int: 1 public static JsonPointer Create(params <string|int>[] segments) { ... } But that’s not a thing C# has. I also can’t create an implicit conversion between from int to string because I don’t own either of those types. (Plus, it would perform that conversion everywhere not just in my method, which would be really bad.) Instead, I kept PointerSegment around. I made it a struct so it doesn’t require an allocation, and I defined implicit casts from string and int (which just converts it to a string). Now, I know what you’re thinking. I just wrote this big paragraph about how making JsonPointer a struct didn’t make sense because its data lived on the heap, and now I’m doing exactly that. Well… yeah, and I’m doing it on purpose. The string that it carries will have needed to be allocated anyway. If the segment was created from a string, no additional allocation; if it was created from an integer, then there’s a small allocation for the int → string conversion. But once that string is allocated, it’s not allocated again later. Further, I can now write my .Create() method to take a parameterized array of PointerSegments, and the compiler will do the work of converting them without an allocation for the segment itself. 1 public static JsonPointer Create(params PointerSegment[] segments) { ... } Enhancement #2 - Building collections (known size) When we need to build a collection of things in .Net, we typically use: something from the System.Collections.Generic namespace, like List<T> or Dictionary<TKey, TValue> LINQ operations like .Select(), .Where(), and (one of my favorites) .Join() or both These provide an easy way to build, query, and otherwise manage collections of things. But most of these are implemented as classes, so they live on the heap. For pointer math (combining pointers / adding segments), I know how many strings I need because each pointer already has an array of strings; I just need to combine those arrays. This means that I can just directly allocate the right-sized array and fill it. 1 var newArray = new string[this._segments.Length + other._segments.Length]; To fill it, instead of using a for loop, I use the Array.Copy() methods to copy the segments in a couple chunks. 1 2 Array.Copy(this._segments, newArray, this._segments.Length); Array.Copy(other._segments, 0, newArray, this._segments.Length, other._segments.Length); That’s it. Honestly, I don’t think this really suffers much in terms of readability. Here’s the LINQ for comparison: 1 var newArray = this._segments.Concat(other._segments).ToArray(); While the LINQ is more concise, the array logic still lets you know what’s going on while really selling the message that performance is a critical concern. During the journey here, I had initially used the approach in the next section for pointer math. Then I realized that I already new how many elements I needed, so I switched to stackalloc, wanting to keep building my collection on the stack. Finally, I realized I can just instantiate the array I needed and fill it directly. Development really is a journey; don’t be afraid to experiment a bit. Enhancement #3 - Building collections (unknown size) During parsing, I need a dynamic collection (meaning I don’t know what size it needs to be) in which I can temporarily store segment strings, which means that I can’t use an array. But I don’t want to allocate a List<string> to hold them, especially since I’m just going to convert that list to an array by the end of it. What I need here is a pre-allocated array of slots where I can put pointers to strings. Memory<string> is the tool I want to use in this case, and I can either create a new one or get one out of the memory pool. 1 using var memory = MemoryPool<string>.Shared.Rent(); Take notice that Memory<T> is disposable. At one point, I didn’t have a using declaration and my memory usage went up 20x! Be sure you release this when you’re done with it! The memory I rented exposes a Span<string> (not read-only), and spans are ref structs so, they must live on the stack. They’re not allowed on the heap. 1 var span = memory.Memory.Span; While debugging, I discovered that this pre-allocates 512 slots for me to fill, which is very likely way more than I’d ever need. The Rent() method does take an optional size parameter, but it’s a minimum size, so I’m not sure if it ends up allocating less. Regardless, the idea here is that the memory is already allocated (or at least it’s allocated once), and I can re-use it when I need to through the memory pool. Now I have an “array” to fill up, which is just the parsing logic. When I’m done, I just need to cut it down to a right-sized span and create an actual array, leaving the strings, the final array, and the JsonPointer itself as the only allocations. 1 string[] newArray = [..span[segmentCount]]; No allocations performed in processing! Wrap up These were the big things that helped me make JsonPointer.Net much more memory-efficient. And since JSON Patch and JSON Schema rely on JSON Pointers, those libraries caught the benefit immediately. Next time, I’m going to review some of the additional JsonSchema.Net improvements I made for v7.0.0. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Better JSON Pointer

This post was going to be something else, and somewhat more boring. Be glad you’re not reading that. In the midst of updating JsonPointer.Net, instead of blindly forging on when metrics looked decent but the code was questionable, I stopped to consider whether I actually wanted to push out the changes I had made. In the end, I’m glad I hesitated. In this post and at least the couple that follow, I will cover my experience trying to squeeze some more performance out of a simple, immutable type. In the before times The JsonPointer class is a typical object-oriented approach to implementing the JSON Pointer specification, RFC 6901. Syntactically, a JSON Pointer is nothing more a series of string segments separated by forward slashes. All of the pointer segments follow the same rule: any tildes (~) or forward slashes (/) need to be escaped; otherwise, just use the string as-is. A class is created to model a segment (PointerSegment), and then another class is created to house a series of them (JsonPointer). Easy. Tack on some functionality for parsing, evaluation, and maybe some pointer math (combining and building pointers), and you have a full implementation. An idea is formed In thinking about how the model could be better, I realized that the class is immutable, and it doesn’t directly hold a lot of data. What if it were a struct? Then it could live on the stack, eliminating a memory allocation. Then, instead of holding a collection of strings, it could hold just the full string and a collection of Range objects could indicate the segments as sort of “zero-allocation substrings”: one string allocation instead of an array of objects that hold strings. This raises a question of whether the string should hold pointer-encoded segments. If it did, then .ToString() could just return the string, eliminating the need to build it, and I could provide new allocation-free string comparison methods that accounted for encoding so that users could still operate on segments. I implemented all of this, and it worked! It actually worked quite well: Version n Mean Error StdDev Gen0 Allocated v4.0.1 1 2.778 us 0.0546 us 0.1025 us 4.1962 8.57 KB v5.0.0 1 1.718 us 0.0335 us 0.0435 us 1.4915 3.05 KB v4.0.1 10 26.749 us 0.5000 us 0.7330 us 41.9617 85.7 KB v5.0.0 10 16.719 us 0.3219 us 0.4186 us 14.8926 30.47 KB v4.0.1 100 286.995 us 5.6853 us 12.5983 us 419.4336 857.03 KB v5.0.0 100 157.159 us 2.5567 us 2.1350 us 149.1699 304.69 KB … for parsing. Pointer math was a bit different: Version n Mean Error StdDev Gen0 Allocated v4.0.1 1 661.2 ns 12.86 ns 11.40 ns 1.1473 2.34 KB v5.0.0 1 916.3 ns 17.46 ns 15.47 ns 1.1120 2.27 KB v4.0.1 10 6,426.4 ns 124.10 ns 121.88 ns 11.4746 23.44 KB v5.0.0 10 9,128.2 ns 180.82 ns 241.39 ns 11.1237 22.73 KB v4.0.1 100 64,469.6 ns 1,309.01 ns 1,093.08 ns 114.7461 234.38 KB v5.0.0 100 92,437.0 ns 1,766.38 ns 1,963.33 ns 111.3281 227.34 KB While the memory allocation decrease was… fine, the 50% run-time increase was unacceptable. I couldn’t figure out what was going on here, so I left it for about a week and started on some updates for JsonSchema.Net (post coming soon). Initially for the pointer math, I was just creating a new string and then parsing that. The memory usage was a bit higher than what’s shown above, but the run-time was almost double. After a bit of thought, I realized I can explicitly build the string and the range array, which cut down on both the run time and the memory, but only so far as what’s shown above. Eureka! After a couple days, I finally figured out that by storing each segment, the old way could re-use segments between pointers. Sharing segments helps with pointer math where we’re chopping up and combining pointers. For example, let’s combine /foo/bar and /baz. Under the old way, the pointers for those hold the arrays ['foo', 'bar'] and ['baz']. When combining them, I’d just merge the arrays: ['foo', 'bar', 'baz']. It’s allocating a new array, but not new strings. All of the segment strings stayed the same. Under the new way, I’d actually build a new string /foo/bar/baz and then build a new array of Ranges to point to the substrings. So this new architecture isn’t better after all. A hybrid design I thought some more about the two approaches. The old approach does pointer math really well, but I don’t like that I have an object (JsonPointer) that contains more objects (PointerSegment) that each contain strings. That seems wasteful. Also, why did I make it a struct? Structs should be a fixed size, and strings are never a fixed size (which is a major reason string is a class). Secondly, the memory of a struct should also live on the stack, and strings and arrays (even arrays of structs) are stored on the heap; so really it’s only the container that’s on the stack. A struct just isn’t the right choice for this type, so I should change it back to a class. What if the pointer just held the strings directly instead of having a secondary PointerSegment class? In the old design, PointerSegment handled all of the decoding/encoding logic, so that would have to live somewhere else, but that’s fine. So I don’t need a model for the segments; plain strings will do. Lastly, I could make it implement IReadOnlyList<string>. That would give users a .Count property, an indexer to access segments, and allow them to iterate over segments directly. A new implementation Taking in all of this analysis, I updated JsonPointer again: It’s a class again. It holds an array of (decoded) strings for the segments. It will cache its string representation. Parsing a pointer already has the string; just store it. Constructing a pointer and calling .ToString() builds on the fly and caches. PointerSegment, which had also been changed to a struct in the first set of changes, remains a struct and acts as an intermediate type so that building pointers in code can mix strings and integer indices. (See the .Create() method used in the code samples below.) Keeping this as a struct means no allocations. I fixed all of my tests and ran the benchmarks again: Parsing Count Mean Error StdDev Gen0 Allocated 5.0.0 1 3.825 us 0.0760 us 0.0961 us 3.0823 6.3 KB 5.0.0 10 36.155 us 0.6979 us 0.9074 us 30.8228 62.97 KB 5.0.0 100 362.064 us 6.7056 us 6.2724 us 308.1055 629.69 KB Math Count Mean Error StdDev Gen0 Allocated 5.0.0 1 538.2 ns 10.12 ns 10.83 ns 0.9794 2 KB 5.0.0 10 5,188.1 ns 97.80 ns 104.65 ns 9.7885 20 KB 5.0.0 100 58,245.0 ns 646.43 ns 539.80 ns 97.9004 200 KB For parsing, run time is higher, generally about 30%, but allocations are down 26%. For pointer math, run time and allocations are both down, about 20% and 15%, respectively. I’m comfortable with the parsing time being a bit higher since I expect more usage of the pointer math. Some new toys In addition to the simple indexer you get from IReadOnlyList<string>, if you’re working in .Net 8, you also get a Range indexer which allows you to create a pointer using a subset of the segments. This is really handy when you want to get the parent of a pointer 1 2 var pointer = JsonPointer.Create("foo", "bar", 5, "baz"); var parent = pointer[..^1]; // /foo/bar/5 or maybe the relative local pointer (i.e. the last segment) 1 2 var pointer = JsonPointer.Create("foo", "bar", 5, "baz"); var local = pointer[^1..]; // /baz These operations are pretty common in JsonSchema.Net. For those of you who haven’t made it to .Net 8 just yet, this functionality is also available as methods: 1 2 3 var pointer = JsonPointer.Create("foo", "bar", 5, "baz"); var parent = pointer.GetAncestor(1); // /foo/bar/5 var local = pointer.GetLocal(1); // /baz Personally, I like the indexer syntax. I was concerned at first that having an indexer return a new object might feel unorthodox to some developers, but that’s exactly what string does when you use a Range index to get a substring, so I’m fine with it. Wrap up I like where this landed a lot more than where it was in the middle. Something just felt off with the design, and I was having trouble isolating what the issue was. I like that PointerSegment isn’t part of the model anymore, and it’s just “syntax candy” to help build pointers. I really like the performance. I learned a lot about memory management, which will be the subject of the next post. But more than that, I learned that sometimes inaction is the right action. I hesitated, and the library is better for it. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

JSON Logic Without Models

Holy performance increase, Batman! I recently made an update to JsonLogic.Net that cut run times and memory usage in half! In half?! Yes! Here’s the benchmark: Method Count Mean Error StdDev Gen0 Allocated Models 1 1,655.9 us 26.76 us 26.28 us 410.1563 838.03 KB Nodes 1 734.5 us 8.16 us 7.23 us 236.3281 482.61 KB Models 10 16,269.0 us 167.06 us 139.50 us 4093.7500 8380.5 KB Nodes 10 7,210.7 us 25.26 us 21.09 us 2359.3750 4826.08 KB Models 100 164,267.3 us 2,227.54 us 1,974.66 us 41000.0000 83803.81 KB Nodes 100 72,195.7 us 139.28 us 116.30 us 23571.4286 48262.05 KB In this table, “Models” is the old way, and “Nodes” is the new way. As you can see, “Nodes” takes less than half as long to run, and it uses just over half the memory. What do “Models” and “Nodes” represent? From the initial release of the library, JSON Logic is represented using its own object model via the Rule abstraction. It would result in a large tree structure of strongly typed rules. This is “Models”. The benefit of this approach is that strong typing, meaning that if you wanted to build some logic in code, you could use the associated builder methods on the static JsonLogic class and you didn’t have to worry about getting argument types wrong. However, as you can expect, building out this rule tree means heap allocations, and allocations, in general, are slow. The “Nodes” approach, introduced with v5.2.0, doesn’t use the object model. Instead, the system is stateless. It uses JsonNode to represent the logic, and the system runs “static” handlers depending on which operation key is present. This is the approach that I took with JSON-e, and it worked out so well that I wanted to see where else I could apply it. I’ve had several attempts at making this approach for JSON Schema, and while it works, the performance isn’t there yet. JSON-e and JSON Logic also share a common basic design: they’re both JSON representations of instructions that are processed with some kind of context data. So no more strong typing? I think that’s where I want to take this library. With all of the soft typing and implicit conversions that JSON Logic uses anyway, I don’t think it’s going to be much of a problem for users. Even on the JSON Logic playground, you enter your logic and data as JSON and it runs from there. I don’t see why this library can’t work the same way. I don’t really see a reason to need an object model. (And with functional programming on the rise, maybe this stateless approach is the way of the future.) But ultimately, it comes down to you. Have a play with the new setup. The docs are already updated. I’d like to hear what you think. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Dropping Project Support for Code Generation

Some time ago, I released my first attempt at code generation from JSON Schemas. However, I’ve decided to deprecate the library in favor of Corvus.JsonSchema. When I created JsonSchema.Net.CodeGeneration, I knew about Corvus.JsonSchema, but I thought it was merely an alternative validator. I didn’t truly understand its approach to supporting JSON Schema in .Net. Today we’re going to take a look at this seeming competitor to see why it actually isn’t one. What is Corvus.JsonSchema? Corvus.JsonSchema is a JSON Schema code generator that bakes validation logic directly into the model. To show this, consider the following schema. 1 2 3 4 5 6 7 8 9 { "type": "object", "properties": { "foo": { "type": "integer", "minimum": 0 } } } As one would expect, the library would generate a class with a single property: int Foo. But it also generates an .IsValid() method that contains all of the validation logic. So if you set model.Foo = -1, the .IsValid() method will return false. However Corvus.JsonSchema has another trick up its sleeve. But before we get into that, it will help to have some understanding of how System.Text.Json’s JsonElement works. A quick review of JsonElement Under the hood, JsonElement captures the portion of the parsed JSON text by using spans. This has a number of follow-on benefits: By avoiding substringing, there are no additional heap allocations. JsonElement can be a struct, which further avoids allocations, because it only maintains references to existing memory. By holding onto the original text, the value can be interpreted different ways. For example, numbers could be read as double or decimal or integer. As an example, consider this string: 1 { "foo": 42, "bar": [ "a string", false ] } Five different JsonElements would be created: top-level object number value under foo array value under bar first element of bar array second element of bar array But the kicker is that everything simply references the original string. Value Backing span top-level object start: 0, length: 44 number value under foo start: 9, length: 2 array value under bar start: 20, length: 21 first element of bar array start: 22, length: 10 second element of bar array start: 34, length: 5 Back to the validator Corvus.JsonSchema builds on this “backing data” pattern that JsonElement establishes. Instead of creating a backing field that is the same type that the property exposes, which is the traditional approach for backing fields, the generated code will use a JsonElement for the backing field while the property is still strongly typed. This means that a model generated by the library can usually be deserialized without any extra allocations, resulting in very high performance! For a much better explanation of what’s going on inside the package than what I can provide, I recommend you watch their showcase video. Keep moving forward Ever since I saw that video, I’ve lamented the fact that it’s only available as a dotnet tool. I’ve always envisioned this functionality as a Roslyn source generator. To that end, I’ve paired with Matthew Adams, one of the primary contributors to Corvus.JsonSchema, as co-mentor on a project proposal for JSON Schema’s submission to Google’s Summer of Code. This project aims to wrap the existing library in an incremental source generator that uses JSON Schema files within a .Net project to automatically generate models at compile time. This is a great opportunity to learn about incremental source generators in .Net and build your open source portfolio. If this sounds like a fun project, please make your interest known by commenting on the proposal issue linked above. (Even if it’s not accepted by GSoc, we’re probably going to do it anyway.) If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

In Pursuit of Native Code

I don’t even know how to begin this post. I don’t think there has been as big an announcement for this project as support for .Net 8 and Native AOT. Yet here we are. HUGE thanks to Jevan Saks for the help on this. This update wouldn’t be possible without him. Saying he coded half of the update would be underselling his contributions! More than code, though, he helped me better understand what all of this AOT stuff is and the APIs that make it work. Additional thanks to Eirik Tsarpalis, who basically is System.Text.Json right now, for helping shed light on intended patterns with JSON serializer contexts. What is Native AOT? Native AOT, or Ahead of Time compilation, is a way to make .Net applications run anywhere using native code, which means they don’t need the runtime to operate. What that means for developers who want to make native apps is generally avoiding dynamically generated code, so mostly no JIT (just-in-time compilation) or reflection that involves generics. You can start to imagine how limiting that can be. It makes things especially difficult for operations like serialization, which traditionally relies heavily on reflection. However, the System.Text.Json team is pretty smart. They’ve figured out that they can use source generation to inspect any code that might be serialized and generate code that stores the type information, all at compile time. But they can’t do that without your help. First, you have to mark your project as AOT-compatible (the source generation stuff can be done outside of AOT). Then you have to set up a serializer context and annotate it with attributes for every type that you expect to serialize. (This is the trigger for the source generation.) Lastly, any usage of a method which uses unsupported reflection will generate a compiler warning, and then you have some attributes that you can use to either pass the warning on to your callers or indicate that you understand the risk. Of course there’s a lot more to understand, and I don’t claim that I do. So go read the .Net docs or a blog post that focuses more on the power of Native AOT to learn more. Why support .Net 8 explicitly? My understanding was that there were a lot of features in .Net 8 that I didn’t have access to when building only to .Net Standard 2.0. Primarily, the compiler only gives the AOT warnings when building for .Net 8. Since that was the goal, it made sense to include the target explicitly. What was unclear to me was that the majority of the features that I wanted to use were actually available through either later versions of the System.Text.Json Nuget package or through Sergio Pedri ’s amazing PolySharp package. I had at some point tried to update to System.Text.Json v7, but I found that a good portion of the tests started failing. I didn’t want to deal with it at the time, so I put it off. Why now? I’ve had a long-standing issue open on GitHub where I considered the possibility of dropping .Net Standard support and moving on to just supporting one of the more modern .Net versions. In that issue, I floated the idea of updating to .Net 6. While that issue languished for almost a year, I had users approach me about supporting features that were only available in later versions of frameworks, which meant that I’d have to multi-target. I’ve multi-targeted in libraries before, and I’ve seen in other libraries the code-reading nightmare that can result from a bunch of compiler directives trying to isolate features that were only available in different .Net versions. Trying to read through all of that to parse out what’s actually compiling under a given framework target can be tough. The springboard for this effort really came from Jevan’s jumping into the deep end and starting the update by creating a PR. This was the kick in the pants I needed. How did the update go? When we started working on this update, the first thing we did was multi-target with .Net 8 in all of the libraries; the tests already targeted .Net Core 3.1 and .Net 6, so we added .Net 8 and called it good. We ended up having to drop support for .Net Core 3.1 due to an incompatibility in one of System.Text.Json’s dependencies. However the framework is out of support now, so we figured it was okay to leave it behind. I set up a feature branch PR with a checklist of things that needed to be done, and we started creating satellite PRs to merge in. We started updating all of the package references and addressing the immediate warnings that came with the updated framework target (mostly null analysis and the like). In order to avoid collisions in our work, we coordinated our efforts in Slack. There were a few times one of us would need to rebase, but overall it went really well. Then we added <IsAotCompatible> properties to all of the library project files, which gave us our first round of AOT warnings to address. We went through almost 40 PRs between Jevan and me, incrementally updating toward a final state. There was a lot of experimentation and discussion over patterns, and I learned a lot about the AOT APIs as well as finding some solutions to a few pitfalls. I can’t tell you how many approaches and workarounds we added only for them to ultimately be removed in favor of something else. But it was part of the learning process, and I don’t know that we could have reached the final solution without going through the alternatives. It wasn’t all adding code, though. Some of the functionality, like the JsonNode.Copy() extension method wasn’t needed anymore because the updated System.Text.Json provides a .DeepClone() that does the same job. By the end of it we were left with just about everything supporting Native AOT. And, mostly thanks to PolySharp, we didn’t need to litter the code with compiler directives. (I was even able to remove the dependency on Jetbrains.Annotations!) The only project that explicitly doesn’t work in an AOT context is the schema generation, which requires high levels of reflection to operate. (But really, I consider that to be more for development tools rather than a production library; it’s supposed to give you a start.) Is there anything to watch out for when updating to the new packages? I’ve bumped the major version on all of the libraries. For many of the libraries, that’s due to .Net Standard 3.1 no longer being supported. Aside from that, it’s small things like removing the JsonNode.Copy() extension method I mentioned earlier and removal of obsolete code. I’ve detailed all of the changes for each library in the release notes, which you can find in the docs. I think most notably is that if you’re not building an AOT-compliant app, you probably won’t need to update much, if anything at all. What’s next? The updated libraries are all available now, so the only thing that’s left for this particular update is updating the docs, which I’ll be working on for the next few weeks probably. As always, if you have any problems with anything, please feel free to drop into Slack or open an issue in GitHub. Until then, enjoy the update! If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Why I'm Updating My JSON Schema Vocabularies

Both of the vocabularies defined by json-everything are getting a facelift. The data vocabulary is getting some new functionality. The UniqueKeys vocabulary is being deprecated in favor of the new Array Extensions vocabulary. I’m also doing a bit of reorganization with the meta-schemas, which I’ll get into. Data vocabulary updates The data vocabulary is actually in its second version already. I don’t keep a link to the first version on the documentation site, but the file is still in the GitHub repo. The second version (2022) clarified some things around how URIs were supposed to be resolved, improved how different data sources could be referenced more explicitly, and added support for Relative JSON Pointers. Most importantly, it disallowed the use of Core vocabulary keywords, which had previously allowed the formed schema to behave differently from its host, introducing some security risks. This new version (2023) merely builds on the 2022 version by adding: the optionalData keyword, which functions the same as data except that if a reference fails to resolve that keyword is ignored rather than validation halting. JSON Path references, which can collect data spread over multiple locations within the instance. I think this is really powerful; there’s an example in the spec. Introducing the Array Extensions vocabulary The uniqueKeys keyword needed some updates anyway. It was the first vocabulary extension I wrote, and some of the language updates that I made to the data vocabulary in its second edition never made it over here. But I didn’t just want update language or URIs; I wanted a functional change. However, the keyword itself doesn’t really need to be changed. I think it’s good as it is. So instead, I’m adding a new keyword, which means it can’t just be the “unique keys” vocabulary anymore. It needs a new name that better reflects all of the defined functionality. So I’m deprecating it and replacing it with the new Array Extensions vocabulary, which does two things: cleans up some language around uniqueKeys without changing the functionality. adds the ordering keyword to validate that items in an array are in an increasing or decreasing sequence based on one or more values within each item. Meta-schema rework I’ve recently had a few discussions (here and here) with some JSON Schema colleagues regarding the “proper” way to make a meta-schema for a vocabulary, and it seems my original approach was a little shortsighted. When I created my meta-schemas, I simply created a 2020-12 extension meta-schema. It’s straight-forward and gets the job done, but it’s not very useful if you want to extend 2020-12 with multiple vocabularies, e.g. if you want to use both Data and UniqueKeys. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 { "$id": "https://json-everything.net/meta/data-2022", "$schema": "https://json-schema.org/draft/2020-12/schema", "$vocabulary": { // <core vocabs> "https://json-everything.net/vocabs-data-2022": true }, "$dynamicAnchor": "meta", "title": "Referenced data meta-schema", "allOf": [ // reference the 2020-12 meta-schema { "$ref": "https://json-schema.org/draft/2020-12/schema" } ], "properties": { "data": { // data keyword definition }, "optionalData": { // optionalData keyword definition (it's the same as data) } } } This isn’t wrong, but it could be done better. Instead of having a single meta-schema that both validate the keyword and extends 2020-12 to use the vocabulary, we separate those purposes. (Feels a lot like SRP to me.) So now we have a vocabulary meta-schema, which only serves to validate that the keyword values are syntactically correct, and a separate draft meta-schema extension which references it. The new Data vocabulary meta-schema look like this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 { "$id": "https://json-everything.net/schema/meta/vocab/data-2023", "$schema": "https://json-schema.org/draft/2020-12/schema", "$defs": { "formedSchema": { // data keyword definition } }, "title": "Referenced data meta-schema", "properties": { "data": { "$ref": "#/$defs/formedSchema" }, "optionalData": { "$ref": "#/$defs/formedSchema" } } } The $vocabulary, $dynamicAnchor, and reference to the 2020-12 meta-schema have all been removed as they’re not necessary to validate the syntax of the vocabulary’s keywords. And the new Data 2020-12 extension meta-schema is this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 { "$id": "https://json-everything.net/schema/meta/data-2023", "$schema": "https://json-schema.org/draft/2020-12/schema", "$vocabulary": { // <core vocabs> "https://docs.json-everything.net/schema/vocabs/data-2023": true }, "$dynamicAnchor": "meta", "title": "Data 2020-12 meta-schema", "allOf": [ { "$ref": "https://json-schema.org/draft/2020-12/schema" }, { "$ref": "https://json-everything.net/schema/meta/vocab/data-2023" } ] } The keyword definition is removed and the vocab meta-schema is referenced. That’s how the 2020-12 meta-schemas did it, and it’s much more reusable this way. The Array Extensions vocabulary meta-schemas are also built this new way. Now, if you want to create a 2020-12 meta-schema that also includes the new Array Extensions vocabulary, you can take the above, change the $id, and add a reference to the Array Vocabulary meta-schema. This approach allows schema authors to more easily mix and match vocabularies as they need for their application. I need validation The new vocabularies are still a work-in-progress, but they’re mostly complete for these versions. I don’t think the Data vocabulary will evolve much more, but I do hope to continue adding to the Array Extensions vocabulary as new functionality is conceived and requested. (There’s actually a really neat concept from Austin Wright, one of the spec authors, regarding patterned item sequence validation.) Questions and comments are welcome in the json-everything Github repository, or leave a comment down below. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

JSON-e Expressions

JSON-e is a data-structure parameterization system for embedding context in JSON objects. At least that’s how they describe it. My take would be that it’s something of an amalgamation between JSONLogic and Jsonnet. It supports expressions, through which it can do a lot of the logic things that JSON Logic gives you, and it can perform templating and data transformations, giving you a lot of what Jsonnet can do. Their docs are really great, and I recommend reading through those. It’s not long, but it still does a good job of covering what JSON-e can do. I’ve also written some docs for how you can use JSON-e in .Net. This post is going to highlight some of the interesting aspects of the expression syntax that I discovered while implementing it. It’s going to take a bit of setup, which is why this post is a bit longer than some of my others. So grab a drink and get comfy because it’s gonna get fun. A brief introduction to JSON-e To start, let’s cover how JSON-e works at a high level. The idea is pretty simple: you have a template and a context. The context is just a JSON object which contains data that may be referenced from the template. The template is a JSON value which contains instructions. Within those instructions can be expressions stored in JSON strings. These expressions are the focus for this post. JSON-e takes the template and the context (JSON in) and gives you a new value (JSON out). What are expressions? Before we get too deep into the weeds, some basic understanding of expressions is warranted. JSON-e expressions are similar to what you might find in most programming languages, but specifically JS or Python. They take some values and perform some operations on those values in order to get a result. The value space follows the basic JSON type system: objects, arrays, numbers, strings, booleans, and null. You get the basic math operators (+, -, *, /, and ** for exponentiation), comparators (<= and friends), and boolean operators (&& and ||). You also get in for checking the contents of arrays and strings, + can concatenate strings, and you get JSON-Path-like value access (.-properties and [] indexers that can take integers, strings, and slices). Operands which are not values are treated as context accessors. That is, symbols that access data contained in the context you provide. This allows expressions like a.b + c[1] where an expected context object might be something like 1 2 3 4 { "a": { "b": 1 }, "c": [ 4, 5, 6 ] } The context While the context that you initially provide to JSON-e is a mere JSON object, as shown above, during processing the context is so much more. There are some other keys that have default values, and they can be overridden by the object you provide. For instance, the property now is assumed to be the ISO 8601 string of the date/time when evaluation begins. This property is used by the $fromNow operator. The effect is that this property is automatically added to the context so that if the template were to reference it directly, e.g. { "$eval": "now" }, the result would just be the date/time string. However, if you were to include a now property in your context, it would override the implicit value. 1 2 3 4 5 { "a": { "b": 1 }, "c": [ 4, 5, 6 ], "now": "2010-08-12T20:35:40+0000" } Furthermore, other operations, e.g. $let, provide their own context data that can override data in your context. But again, this is only within the scope of the operation. Once you leave that operation, its overrides no longer apply. The net effect of all of this is that the context is actually a stack of JSON objects. Looking up a value starts at the top and works its way down until the value is found. In this way, you can think of that default now value as being a low-level context object with just the now key/value. Function support Expressions also support functions, and you get some handy built-in ones, like min() and uppercase(). Each function declares what it expects for parameters and what its output is. And just like operands for the expression operators, arguments to functions can be just about anything, even full expressions. This enables composing functions and passing context values into functions. 1 { "$eval": "min(a + 1, b * 2)" } with the context 1 { "a": 4, "b": 2 } will result in 4. Functions as values This is where it gets really cool. I lied a little above when I said the value space is the JSON data types. Functions are also valid values. This enables being able to pass functions around as data. Many languages have this built in, but it’s not part of JSON. Every implementation will likely be a bit different in how it makes this happen due to the constraints of how JSON is handled in that language, but JSON-e regards this as a very important feature. For example, I could have the template 1 { "$eval": "x(1, 2, 3)" } In this case, x isn’t defined, and it’s expecting the user to supply the function that should run. The only requirement is that the function must take several numbers as parameters. A context for this template could be something like 1 { "x": "min" } min is recognized as the function of the same name, and so that’s what’s called. You can also do this 1 { "$eval": "[min,max][x](1, 2, 3)" } with the context 1 { "x": 1 } This would run the max function from the array of functions that starts the expression, giving 3 as the result. Note that arrays and objects inside expressions aren’t JSON/YAML values, even though it may look like they are. Because their values can be functions or reference the context, they need to be treated as their own thing: expression arrays and expression objects. In .Net But, you may think, json-everything is built on top of the System.Text.Json namespace, specifically focusing on JsonNode, and surely you can’t just put a function in a JsonObject, right? Wrong! You can wrap anything you want in a JsonValue using its static .Create() method, which means you can absolutely add a function to a JsonObject! JSON-e functions are pretty simple: they take a number of JSON parameters and output a single JSON value. They also need to have access to the context. That gives us a signature: 1 JsonNode? Invoke(JsonNode?[] arguments, EvaluationContext context) In order to get this stored in a JsonValue, you could just store the delegate directly, but I found that it was more beneficial to create a base class from which each built-in function could derive. Also, in the base class I could define an implicit cast to JsonValue, which enables easily adding functions directly to nodes! 1 2 3 4 var obj = new JsonObject { ["foo"] = new MinFunction() } At certain points in the implementation, when I need to check to see if a value is a function, I do it just like I’m checking for a string or a number: 1 2 3 4 if (node is JsonValue val && val.TryGetValue(out FunctionDefinition? func)) { // ... } Embedding functions as data was such a neat idea! JSON-e has a requirement that a function MUST NOT be included as a value in the final output. It can be passed around between operators during evaluation; it just can’t come out into the final result. Also, kudos to the System.Text.Json.Nodes design team for allowing JsonValue to wrap anything! I don’t think I’d have been able to support this with my older Manatee.Json models. Custom functions What’s more, JSON-e allows custom functions! That is, you can provide your own functions in the context and call those functions from within expressions! You want a modulus function? JSON-e doesn’t provide that out of the box, but it does let you provide it. In this library, it means providing an instance of JsonFunction (following the naming scheme of JsonValue, JsonArray, and JsonObject) along with a delegate that matches the signature from above. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 var context = new JsonObject { ["mod"] = JsonFunction.Create((parameters, context) => { var a = parameters[0]?.AsValue().GetNumber(); var b = parameters[1]?.AsValue().GetNumber(); return a % b; }) }; var template = new JsonObject { ["$eval"] = "mod(10, 4)" }; var result = JsonE.Evaluate(template, context); // 2 Bringing it all together And finally, the three aspects of JSON-e that I’ve discussed in this post come together in the most beautiful way. The context is a stack of JSON objects. Functions are values. Custom functions can be conveyed via the context. JsonPath.Net also supports custom functions in its expressions. To manage custom functions there, the static FunctionRepository class is used. At first, I wanted to use this same approach for JSON-e. But once I figured out how to embed functions in data, I realized that I could just pre-load all of the functions into another layer of the context. Then the context lookup does all of the work for me! So now, when you begin the evaluation, the context actually looks like this: 1 2 3 4 // top of stack - <user provided context> - { "now": "<evaluation start time>" } - { "min": <min func as a value>, "max": <max func as a value>, ... } Figuring this out was the key that unlocked everything else in my mind. How to include functions in a JSON object was the hard part. Once I realized that, the rest just kinda wrote itself. Introducing JsonE.Net All of this is to say that I’ve had a fun time bringing JSON-e to .Net and the json-everything project. I’ve learned a lot while building it, including aspects of functional programming, the whole putting-anything-into-JsonValue thing, and new ideas around expression parsing. I’ll definitely be revisiting some of the other libs to see where I can apply my new understanding. It’s also been great working with the JSON-e folks, specifically Dustin Mitchell, who has been very accommodating and responsive. He’s done well to create an environment where questions, feedback, and contributions are welcome. This library is now available on Nuget! If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

.Net Decimals are Weird

I’ve discovered another odd consequence of what is probably fully intentional code: 4m != 4.0m. Okay, that’s not strictly true, but it does seem so if you’re comparing the values in JSON. 1 2 3 4 5 6 7 8 var a = 4m; var b = 4.0m; JsonNode jsonA = a; JsonNOde jsonB = b; // use .IsEquivalentTo() from Json.More.Net Assert.True(jsonA.IsEquivalentTo(jsonB)); // fails! What?! This took me so long to find… What’s happening (brother) The main insight is contained in this StackOverflow answer. decimal has the ability to retain significant digits! Even if those digits are expressed in code!! So when we type 4.0m in C# code, the compiler tells System.Decimal that the .0 is important. When the value is printed (e.g. via .ToString()), even without specifying a format, you get 4.0 back. And this includes when serializing to JSON. If you debug the code above, you’ll see that a has a value of 4 while b has a value of 4.0. Even before it gets to the JsonNode assignments. While this doesn’t affect numeric equality, it could affect equality that relies on the string representation of the number (like in JSON). How this bit me In developing a new library for JSON-e support (spoiler, I guess), I found a test that was failing, and I couldn’t understand why. I won’t go into the full details here, but JSON-e supports expressions, and one of the tests has the expression 4 == 3.2 + 0.8. Simple enough, right? So why was I failing this? When getting numbers from JSON throughout all of my libraries, I chose to use decimal because I felt it was more important to support JSON’s arbitrary precision with decimal’s higher precision rather than using double for a bit more range. So when parsing the above expression, I get a tree that looks like this: 1 2 3 4 5 == / \ 4 + / \ 3.2 0.8 where each of the numbers are represented as JsonNodes with decimals underneath. When the system processes 3.2 + 0.8, it gives me 4.0. As I said before, numeric comparisons between decimals work fine. But in these expressions, == doesn’t compare just numbers; it compares JsonNodes. And it does so using my .IsEquivalentTo() extension method, found in Json.More.Net. What’s wrong with the extension? When I built the extension method, I already had one for JsonElement. (It handles everything correctly, too.) However JsonNode doesn’t always store JsonElement underneath. It can also store the raw value. This has an interesting nuance to the problem in that if the JsonNodes are parsed: 1 2 3 4 var jsonA = JsonNode.Parse("4"); var jsonB = JsonNode.Parse("4.0"); Assert.True(jsonA.IsEquivalentTo(jsonB)); the assertion passes because parsing into JsonNode just stores JsonElement, and the comparison works for that. So instead of rehashing all of the possibilities of checking strings, booleans, and all of the various numeric types, I figured it’d be simple enough to just .ToString() the node and compare the output. And it worked… until I tried the expression above. For 18 months it’s worked without any problems. Such is software development, I suppose. It’s fixed now So now I check explicitly for numeric equality by calling .GetNumber(), which checks all of the various .Net number types returns a decimal? (null if it’s not a number). There’s a new Json.More.Net package available for those impacted by this (I didn’t receive any reports). And that’s the story of how creating a new package to support a new JSON functionality showed me how 4 is not always 4. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!

Interpreting JSON Schema Output

Cross-posting from the JSON Schema Blog. I’ve received a lot of questions (and purported bugs) and had quite a few discussions over the past few years regarding JSON Schema output, and by far the most common is, “Why does my passing validation contain errors?” Let’s dig in. No Problem Before we get into where the output may be confusing, let’s have a review of a happy path, where either all of the child nodes are valid, so the overall validation is valid, or one or more of the child nodes is invalid, so the overall validation is invalid. These cases are pretty easy to understand, so it serves as a good place to start. 1 2 3 4 5 6 7 8 9 10 { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://json-schema.org/blog/interpreting-output/example1", "type": "object", "properties": { "foo": { "type": "boolean" }, "bar": { "type": "integer" } }, "required": [ "foo" ] } This is a pretty basic schema, where this is a passing instance: 1 { "foo": true, "bar": 1 } with the output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "", "annotations": { "properties": [ "foo", "bar" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": true, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar" } ] } All of the subschema output nodes in /details are valid, and the root is valid, and everyone’s happy. Similarly, this is a failing instance (because bar is a string): 1 { "foo": true, "bar": "value" } with the output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 { "valid": false, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "", "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar", "errors": { "type": "Value is \"string\" but should be \"integer\"" } } ] } The subschema output at /details/1 is invalid, and the root is invalid, and while we may be a bit less happy because it failed, we at least understand why. So is that always the case? Can a subschema that passes validation have failed subschemas? Absolutely! More Complexity There are limitless ways that we can create a schema and an instance that pass it while outputting a failed node. Pretty much all of them have to do with keywords that present multiple options (anyOf or oneOf) or conditionals (if, then, and else). These cases, specifically, have subschemas that are designed to fail while still producing a successful validation outcome. For this post, I’m going to focus on the conditional schema below, but the same ideas pertain to schemas that contain “multiple option” keywords. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://json-schema.org/blog/interpreting-output/exmaple2", "type": "object", "properties": { "foo": { "type": "boolean" } }, "required": ["foo"], "if": { "properties": { "foo": { "const": "true" } } }, "then": { "required": ["bar"] }, "else": { "required": ["baz"] } } This schema says that if foo is true, we also need a bar property, otherwise we need a baz property. Thus, both of the following are valid: 1 { "foo": true, "bar": 1 } 1 { "foo": false, "baz": 1 } When we look at the validation output for the first instance, we get output that resembles the happy path from the previous section: all of the output nodes have valid: true, and everything makes sense. However, looking at the validation output for the second instance (below), we notice that the output node for the /if subschema has valid: false. But the overall validation passed. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/if", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/if", "instanceLocation": "", "details": [ { "valid": false, "evaluationPath": "/if/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/if/properties/foo", "instanceLocation": "/foo", "errors": { "const": "Expected \"\\\"true\\\"\"" } } ] }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/else", "instanceLocation": "" } ] } How can this be? Output Includes Why Often more important than the simple result that an instance passed validation is why it passed validation, especially if it’s not the expected outcome. In order to support this, it’s necessary to include all relevant output nodes. If we exclude the failed output nodes from the result, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] }, "details": [ { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/properties/foo", "instanceLocation": "/foo" }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example2#/else", "instanceLocation": "" } ] } we see that the /else subschema was evaluated, from which we can infer that the /if subschema MUST have failed. However, we have no information as to why it failed because that subschema’s output was omitted. But looking back at the full output, it’s clear that the /if subschema failed because it expected foo to be true. For this reason, the output must retain the nodes for all evaluated subschemas. It’s also important to note that the specification states that the if keyword doesn’t directly affect the overall validation result. A Note About Format Before we finish up, there is one other aspect of reading output that can be important: format. All of the above examples use the Hierarchical format (formerly Verbose). However, depending on your needs and preferences, you may want to use the List format (formerly Basic). Here’s the output from the simple schema in List format: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "valid": false, "details": [ { "valid": false, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#", "instanceLocation": "" }, { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/properties/bar", "schemaLocation": "https://json-schema.org/blog/interpreting-output/example1#/properties/bar", "instanceLocation": "/bar", "errors": { "type": "Value is \"string\" but should be \"integer\"" } } ] } This is easy to read and process because all of the output nodes are on a single level. To find errors, you just need to scan the nodes in /details for any that contain errors. Here’s the output from the conditional schema in List format: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 { "valid": true, "details": [ { "valid": true, "evaluationPath": "", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#", "instanceLocation": "", "annotations": { "properties": [ "foo" ] } }, { "valid": true, "evaluationPath": "/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/properties/foo", "instanceLocation": "/foo" }, { "valid": false, "evaluationPath": "/if", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/if", "instanceLocation": "" }, { "valid": true, "evaluationPath": "/else", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/else", "instanceLocation": "" }, { "valid": false, "evaluationPath": "/if/properties/foo", "schemaLocation": "https://json-schema.org/blog/interpreting-output/exmaple2#/if/properties/foo", "instanceLocation": "/foo", "errors": { "const": "Expected \"\\\"true\\\"\"" } } ] } Here, it becomes obvious that we can’t just scan for errors because we have to consider where those errors are coming from. The error in the last output node only pertains to the /if subschema, which (as mentioned before) doesn’t affect the validation result. Wrap-up JSON Schema output gives you all of the information that you need in order to know what the validation result is and how an evaluator came to that result. Knowing how to read it, though, takes understanding of why all the pieces are there. If you have any questions, feel free to ask on the JSON Schema Slack workspace or open a discussion. All output was generated using my online evaluator https://json-everyting.net/json-schema. If you like the work I put out, and would like to help ensure that I keep it up, please consider becoming a sponsor!