Turbocharged: Writing High-Performance C# and .NET Code - Steve Gordon - NDC Oslo 2024

40.42k views13200 WordsCopy TextShare

NDC Conferences

This talk was recorded at NDC Oslo in Oslo, Norway. #ndcoslo #ndcconferences #developer #softwaredev...

Video Transcript:

hello yes we're working good afternoon uh big thank you to everyone that's come along great to see uh so many people here to learn about turbocharging our net code uh with some high performance apis that have been available for a while now actually um but hopefully might be new to some of you in the room my name is Steve I'm a Microsoft MVP blite offer and engineer at elastic uh you can find me at stevej Gordon on most of the social media platforms and I blog at stevej gordon. co.uk um and I will just highlight

this resources link I've got at the bottom there that will take you to a copy of the slides and also uh the GitHub repo where the code I'm going to show you is contained so grab that link now if you can I'll leave it up for a second um I'll also show it again at the end for anyone that misses it I'll talk about the next slide while you're taking photos so I'm going to just briefly set the agenda and set the scene uh for what we're going to cover today so the main things I'm

going to talk about are um a little bit about what is performance because I want to make sure that we're all talking about the same thing and thinking about the same things as we go through this session then we'll talk a little bit about measuring performance in application code and why that's important then we'll start looking at some of the the high performance apis that you can use for both uh efficient and low allocation code in.net so things like span of te um array PS uh a little bit about system IO pipelines and and at

the end hopefully if I can get through it quick enough uh we'll talk about system text Json briefly as well so the first thing is what are we talking about when we're talking about performance in application code the first thing I think of and that most people tend to think of is just how quickly the code executes how long it takes for maybe the method or the entire process uh that you're you're doing in that code base uh maybe processing a message from a queue what is the execution time of that code and it might

be measured in a few seconds it might be millisecond it could even be as low as nanc when we start talking about specific portions of code typically if we can make the code run faster then it's going to have a better experience for our user base um and it might mean that we can get more work done on that machine and that's where throughput as another measure comes in because this is more about how much uh an actual application can do of a given task in a given time frame so we might be talking about

things like request per second in an asbet core application or maybe how many messages we can process of a service bus in a given fr of time with a given application and this can be tied to execution time if we can reduce the execution time of our code sometimes we can increase our throughput not always um but they can be linked in that way um and throughput is just um a sort of good measure that you can also track in production as well and then the final aspect that not everyone thinks about when you start

thinking about performan is around memory allocations um memory allocations in net are very cheap uh we have the Heap pre-allocated block of memory that is being managed for us by the Run time and it's basically just bumping a pointer when we allocate new objects on there so that's very fast very efficient but at some point in time those objects are no longer used uh a GC a garbage collection process needs to run to reclaim that memory and although that's a highly efficient process it's highly optimized there's a lot of horis STS that go into how

that runs to try and avoid pausing our application for any undue amounts of time it's not entirely free and so one of the things we can see in applications that allocate a lot particularly larger objects um is we can see that those GC pause time start to impact uh the performance of high frer applications it's important also to point that uh performance is contextural right not every application has the same performance requirements and characteristics a lot of what I'm going to show you today doesn't apply to most applications in terms of the types of code

that you might be writing um but in a few percent of your applications in high frit systems in areas where you're having to do a lot of processing and you have to have a lot of scale to achieve that starting to look at optimizing your code can start to bring down uh the amount you have to scale your applications and potentially the costs that associated with them but it doesn't mean that you need to apply these code examples that I'm going to show you today in every scenario there's also an important tradeoff to warn you

about when it comes to some I'm going to show you today and we'll see this when we look at some of the more advanced code samples there's a bit of a trade-off between the performant code and the readability of that code um and this can affect how easy it is to maintain applications so if you've got a code based that's changing often and it's being maintained by lots of different Engineers within your business it might be more important that the codee's readable uh concise and clear and something that they can achange very easily without having

to worry about affecting the performance of the application and so it might be more important that you focus on that readability of the code versus actually trying to highly optimize the code base for performance but in those scenarios where you do need the performance then you need to start accepting that trade-off that the code might become more Vose slightly harder to read read and maybe your senior Engineers can sort of go in there and not be scared of that code base not sure why that's done that okay so I wanted to talk about the optimization

cycle um this is how we go about optimizing a sort of application that we already have and the important starting point for any optimization is that we measure first the reason we measure first is it gives us guidance as to where we should be focusing our time and also gives us our starting measurements to see if the changes we're making are having a positive impact so the high level this would be things like profiling the application doing maybe CPU profiling to see where are the hot Paths of your application under given workloads uh what are

the methods that get called most often and of those how long are some of those taking to execute and then focusing in maybe also on the memory side and profiling the memory of those methods to understand where are the allocations that occur within them once you've got that you can guide where you spend your time because you want to be focusing on those hot paths uh they're the ones that are going to give you the best gain if you can optimize them once you've measured and you've got that data you want to start focusing on

code level measurements and this is where we start using things like benchmarking to measure the portions of code that we're going to change to see how those particular portions of code run and that might be individual methods or even individual lines of code that we're starting to focus in on but once we've got those starting point measurements then we can actually do the optimization work what we want to focus on is trying to do this in an iterative way and in small changes um if you make a whole heap of changes at once a lot

of high performance techniques and then you take your measurement afterwards you don't know if all of those changes had a positive impact whereas if you change individual small pieces and then measure again you can revalidate your assumptions as you go forward so measure do a small optimization measure again check that that has the impact you expect and then move on and complete the cycle and so this can continue as many times as you need until you get to either whatever goal you set yourself in terms of your optimization or maybe you think there's no further

gains for a particular method and you want to move on to a different area um it's quite easy to get trapped in optimization looking for really small gains at the kind of tail end of the optimization and sometimes those don't give you a huge amount of value so try and draw a line once you've achieved a reasonable amount of of progress and look at other areas perhaps that will give you bigger wins so how do we measure application performance well there's several tools that we might have daily available to us so if you're using visual

studio and you turn on the diagnostic tool when you're debugging you already have quite a good set of information uh you can see when uh gc's are occurring how much heat memory you're consuming in an application you can take snapshots to see where that memory is being allocated um the only sort of caveat with this is that you are in debug mode so you're not running the highly optimized runtime code base necessarily so it's a good indicator but it's not necessarily scientifically accurate for your runtime code um to do more accurate measurements on runtime code

you want to use either the visual studio profiling tools there's a whole sve of them now that you can use to analyze both memory CPU time async time threading Etc there's also perfu which is an extremely Advanced tool um very powerful quite a steep learning curve I haven't really conquered it yet in my own time I use it a little bit uh and then kind of get a bit scared um and move to things that I'm more familiar with and I find that the jetbrains products. trace. memory for me uh give a nice balance of

exactly what I need to see a nice user interface um and uh as I say you get the measurements that you need that you can work with um but pick the tooling that works for you um and works for you and your team really sometimes not not often but sometimes it can be useful to dig into the IL code as well so when you compile C in most cases you're compiling to Intermediate Language code that then gets executed by the just in time compiler when you run your application and sometimes you can view that I

code to get a view of how many I instructions your code compiles down into if you can reduce the instruction often you increase reduce sorry the execution time um and also you can look out for things like the boxing of value types sometimes those sharp in il um tools as well so these are all free so you can grab that at any time and take a look at the I code as well and the important one not to forget is that it's always worth having production um metrics and monitoring in place uh for your applications

particularly if you're doing optimization on an existing app because sometimes some of the optimizations we do in development don't necessarily translate to production and sometimes they can even regress performance in production because it's hard to match that production workload in an environment where you're doing development so having these measures in place before you start things like request per second Heap usage maybe you're tracking the GC metrics um means that you can actually check once you deploy your changes that you see the same um sort of improvements that you were expecting and I worked for elastic

we have a bunch of tools including the elastic APM agent or our new uh drro for open telemetry uh that you can go and get for uh application performance monitoring needs including metrics like the ones I discussed or you can use any sort of vendor product uh that gives you the metrics you need to measure your applications when it comes to the specific code level measurements the tool of choice today is benchmark.com benchmarking this is measuring small amounts of code and tracking the changes even down to the nanc measurements um it can be used for

more macro level measurements but it's as I say sort of designed specifically for micro benchmarking scenarios and it gives highly precise measurements so you could in theory just put a stop rch around some code and measure that a few times and you would get a rough indicator of how long that code takes to execute but doing that isn't that scientific whereas benchmark.us before it starts taking measurements to make sure that the code is jitted to the most optimized jit version um it also makes sure it measures its own overhead to remove that from the measurements

it's Tak and it will run each Benchmark tens even hundreds of thousands of times so that it can get a good statistical average of the data because if you're measuring down to nanc level measurements anything that else that happens on your machine like a antivirus kicking in or another application starting up could affect the measurements so benchmark.us out the outliers to get you um as accurate a number as it can for what you're measuring um you can do uh enrichment of data through this thing called diagnosis there's a bunch of them uh in there so

we can get memory information during the benchmarking process we can look at the jit uh process uh we can get a disassembly of the code um we can look at threading and and etw events as well so there's a bunch of things you can collect during the benchmarking process if you want them um You can compare your benchmarks across different platforms different architectures you can try changing GC modes and see how that affects your results and this is the tool that's used extensively within Microsoft for their benchmarking of the net runtime and the asp.net so

this is pretty much now the defacto tool for benchmarking net code um it's maintained by a bunch of people including some people that work at Microsoft so this is what a a sort of hello world Benchmark app might look like so we have a a program class here with a main method and that's just invoking this static run method on The Benchmark Runner and it's passing in the type that contains the benchmarks now there's several ways you can configure inter active mode where when you start it you can choose which benchmarks you want to run

this just sort of this hello world example will just run everything within that class so our Benchmark class is defined below regular class I've added this attribute to say that I want to collect memory information through the memory diagnoser uh so that we can see how much memory is allocated while we're benchmarking there's a bit of setup code this isn't stuff that I want to Benchmark it's just stuff I need in place to perform the Benchmark so it's done in this case as as static data so create an instance of this name paer and we

create an instance of a string that we're going to use um the actual Benchmark itself is below and it's a regular method has the Benchmark attribute on it and then within there the code there is what we're measuring that's what's we're going to get the measurement of the Benchmark for and in this case it's this call to the get last name method on that uh name paer that's what we're measuring to see what its overhead is with that given input string so we run this on our machine uh we need to run this in release

build we traditionally ideally want to turn turn off every other application so that it doesn't interfere with the results um if you can disable antivirus that's even better um it will take a while to run because it will as I say do all of those warm-up phases and then many thousands of runs to make sure it gets you good results but eventually it will give you this report and for our single Benchmark we can see a few results here so the first is that we know that the mean execution time that it measured for that

method was 116 NS we don't know if that's necessarily good or bad it's pretty fast um but that's our starting point knowledge about how long that method takes to execute it also includes some memory information because I added the memory diagnoser so this First Column here gives us an indication of per thousand operations how many gen zero collections might this introduce now this isn't 100% scientific because the garbage collection process is sort of tweaking itself all the time based on knowledge of your app at runtime but this is saying basically it's around 29,000 operations before

we would introduce enough uh gen zero data that might trigger a GC in this application the important thing here is that we can tell we're not having any long lived objects because there's nothing being reported for Gen one or gen two so nothing would be promoted so every allocation for this method is probably shortlived and um we can see the actual allocation amount there as 144 bytes so this gives us again that indication that starting point knowledge of how many bytes that code allocates in the process of executing so we'll start looking at some of

the high performance code scenarios [Music] um and the first one I want to start with is span of te question I always like to ask is how many people in the room have heard of span of tea it's pretty good number say most um how many people have used it in a real world application for something far far fewer which is also about what I expected maybe I don't know 5 to 10% of the room have maybe used it for something that's kind of what I expect so Microsoft talked a lot about this type when

they introduced it and they still do the reason they talk about it a lot is because it introduced massive improvements to what they could optimize in the net runtime itself so this sort of open the door for some of the work that's happened since kind of net core 21 era um for all of those high performance things and if any of you look at sort of like Steven Tal's blog posts every year where he talks about all the high performance optimizations and where they've reduced memory allocations or increased the performance of code a lot of

that starts to talk back to the fact that span of te exists because it opens door for that but also not many people have used it and I think partly because Microsoft did caution against most developers needing to use Span in their code bases that message has been I think DED down a bit now and I agree it shouldn't be the general message I think this has its uses and as you'll see in the code it doesn't necessarily always make it overly complex and can give you quite good wins so what span of te um

when span of te was introduced it was built into net core 2.1 so that's like six years ago now um so it's been around a while it was also released as a a new get package so system. memory that you can bring into framework applications as well the optimizations aren't quite as sufficient uh or efficient sorry on um framework because there were runtime changes made in doet core to get the full performance gains but it's still a pretty good use case even if you're using Framework apps ultimately what span does is provide a read write

view over some contiguous region of memory um the important thing is that it doesn't care where that memory is so it doesn't mind if that memory is on the Heap traditional contiguous regions of memory would be things like strings or arrays which are a block of memory on the Heap but it can also reference um stack allocated memory um and even unmanaged native memory um but all through a single API and it does this in a type safe memory safe way so you don't have to think about all of the the possible on caveats with

doing that and drop down to unsafe code it's all safe code that you don't have to have those unsafe keywords to use um you can iterate through this contiguous memory you can index through it you can sort of pause through it um just as you would with a traditional array really um and it's almost as efficient like there's maybe like sub single digit nanc performance overhead to having a span versus a native array but a lot of work went into making this highly efficient um in Corr in particular so one of the key operations once

you've got a span is this idea of slicing so slicing is really just changing the view over an existing blocker of memory so in this scenario we start by creating a new array of nine elements um we then call as span to get the span representation of that memory you'll see it returns a span of int um you might there be thinking oh well that's an allocation we've got a problem but actually span is a type that's guaranteed never to be Heap allocated and we'll talk a little bit more about why and how um but

the main thing is that there's no Heap allocation cost there's no overhead to creating this span so it's a highly efficient type so once we've got our span we can do something like SL icing and slicing just takes a start position and optionally a length for how we want to slice that data and it just Returns the same data that exists already in that memory region but sliced to that new view I kind of give the analogy of this of um photography so if any of you you know take photographs when you visit a new

city we've been out um my family and I in Oslo we've taken some nice pictures from across the water of the Opera house and then at some point in time we think actually it' be nice to see that a bit closer so one option is that we could walk around the water the front we can get closer and we can take a nice close-up image but it takes some time takes some effort or we can just pinch zoom on our camera most of the time that gives us the view we want um and it takes

almost no time at all and that's kind of what slicing is it's we've already got the data we're just changing our view over it and the important thing about slicing is it's a constant time constant cost operation and know of one and so this means that even if that array is nine elements or 9 million elements slicing just takes the same amount of time because we're not creating any new memory we're not copying memory we're just changing a view over it so let's look at optimizing some code I will warn you this first example is

extremely contrived and trivial just to get the point across we'll look at some more real world code um shortly but imagine that tomorrow or Monday when you return to the office uh your manager comes to you and says look if we can develop a method that accepts an array of even number elements that can um we can return a quarter of those element starting from the middle we're going to make some money somehow probably not but let's imagine that's a scenario that you're given I'm sure you're all hoping that's the most complex thing you're given

on Monday morning um many developers given something like that might opt to just use something like link where you you get the existing length and then you say well we'll skip halfway into that we'll take a cord of the elements and then we'll just return the aray and job done right and that's fine that's reasonable code for the requirement that's been given but then our manager comes to us and says hang on no we can make a lot more money if you can make this quicker so let's go ahead and think about that so the

first thing as we learned earlier is that we should measure before we make any assumptions about what we're doing so here this is a slightly more advanced version of uh benchmarking so this is our setup um for our benchmarks so the First new thing on this slide is this uh size parameter um um or size property that has a pram attribute and this has three different values and what this allows us to do is say to benchmark.us it can of often get tricked by just testing one scenario and assuming that's giving us like the coverage

for our entire um possibilities of what could happen in our application so it's best to test around the edges as well so maybe we always assume that our array is going to be about a thousand elements but sometimes it can be a little bit bigger or a little bit smaller so we start to measure both sides just to see if our changes apply in all scenarios so once we've set that up uh we need a way of populating that um array and so this setup method here which is attributed with global setup is basically going

to run once per Benchmark and it just allows us to initialize the array of the right size based on that size property that will be populated by benchmark.us uh fill it up um so none of this contributes to the measurement of the Benchmark it's all setup work the actual Benchmark is very simple we're just testing that link uh code that we had um the only difference from the last example is that we now have this Baseline equals true set on there which says this is our starting point so all other Benchmark measurements will be benchmarked

against this and so when we run this we get three results this time one for each of those different sizes and we can see that okay it's about 100 NS on 100 elements and we allocated 224 bytes it got slower uh for the next size up and it allocated a bit more and again factors more increase uh in execution time and allocations as we go up not that unexpected we're creating a new array and we're filling it up and the bigger the size the bigger the array the more copy needed so the first theory that

one of our engineers has is well I've heard that link can have some overhead to it so maybe we shouldn't use Link and we should try and avoid that so we can do that by just creating a new array of the appropriate size using something like array. copy to copy the memory in uh from the appropriate point and then just return the new array and so we measure this and for the first result of 100 elements 86% reduction in execution time this that sounds good and it looks like we've almost halved our allocations so on

on that row alone it looks pretty good but we can see that for a th000 elements the the execution time gain is still there but the allocation gain is far less significant and as we drop down to 10,000 elements we're now only 1% improved and that's because all we've really shaved off is the 96 bytes of Link overhead of the expression being compiled down but everything else you know we're still creating an array we're still having to copy memory and so um there's still the allocations from that so the final option and this is where

the example is a bit contrived and trival assumes that we can change our return type of this method or introduce a new method that returns a span but and our caller can work with that but if they can then we can just drop down to slicing the existing memory to change the view so that our caller can then work with that piece of memory and do some further processing and so in this scenario we're just uh you know add span and then slice so then when we're using span on my machine that's less than a

nanc on the measurement so a pretty dramatic execution time Improvement and we've not got any allocation cost we've not created any new array to return we' just given someone a new view over that existing data and you can see that that then is roughly consistent um you know when we're down to sub nanc measurements the variation is quite significant um but about 6.7 on my machine nanocs for that to run and this is where we can see that regardless of the array size that goes in there's a constant time constant cost operation to slicing that

data and changing the view on it we can also work with strings using spans so we can call as span on a string literal or on a variable that points to a string and that returns to us a read only span of char so Char because strings are made up of characters uh so that's the element that we're going to look at and read only because strings are immutable and if we were given a span that's a read right view over the memory contained by the string uh we could break that immutability of strings and

basically break every apps assumptions everywhere and probably most apps would fall apart so the runtime enforces this readon nature where we can view that String's memory we can pause our way through it we can sort of learn and process that String's memory but we can't change the characters and so once we do that we could do something like you know in this trivial example say if we wanted to just get the data that represents the surname we could find the index of the last space in this string and then slice from there and then we're

viewing the surname portion and we could do some further processing now on this string this isn't going to give you a huge advantage over using something like string split but on a large string um processing with spans can be really optimal now there are some some important limitations of span that um it's worth highlighting I'll come back to that point I mentioned earlier where I said that span can never be allocated on the Heap um and it's done and enforced through this new keyword that was introduced in C 7.2 where we have this ref struct

uh type now so regular struct types as most of you know will generally end up being stack allocated but there are various ways that they could either be boxed onto the Heap or they could be contained in an object that's stored on the Heap and so that memory is on the Heap ref structs are enforced by the runtime that they can never that never can never happen and that's important because it might point at um stack allocated memory so we don't want this span to outlive the memory that it's giving you a view over and

there's various other sort of internal implementation details about R this matters around struct tearing um and the sort of Biore like nature of these types that could introduce um overhead for the GC as well but by enforcing them as this special ref struct all those risks go away and that means the team could optimize the type very well but it does mean that the type cannot be boxed which can be a problem it can't be a filled in a regular class because that class is on the Heap and it's its data structures are also therefore

on the Heap and it can't be a filled in a regular struct because that could be boxed onto the Heap it can't be used as an argument or a local within an async method this is the one that probably most people are concerned about and we'll see how you work around it in a moment um this is important because those async methods end up being being sort of recompiled on behind the scenes to uh State machine classes or strs um and those again sort of break the earlier rules because any of those locals or V

arguments end up as fields on those um State machines uh for a similar reason it can't be captured inside a Lambda expression because of the closure types that get generated uh can't be used as a generic type argument and there's more um but so it's at this point you're probably thinking well hang on this is really really limited where can I actually use this thing but it's most places is you can work around most of these and that's because there's a sister type called memory of T which has a lot of the same apis and

doesn't have the quite the same requirements so it can live on the Heap um it's slightly less versatile than using span and it's slightly less performant um it's defined as a rening struct not a ref struct so doing things like slicing into a memory of te is slightly slower than compared to span um but at any point in time once you've got that memory of te you can call it span property and then get the span rep presentation of the data that it's looking at and so what this means is in this scenario say where

we want to have this async method except a span of of of bite here uh the compiler is upset and says no you can't um for the reasons I've already discussed and so what we can do in this method our signature here can accept a memory of bite instead and so what our code could do is just work with that and it's got the same apis for slicing as we saw with span or if we do really want to highly optimize this code and rely on span we can create a non-async method that does accept

a spanner bite and then call into that at any point we're outside of the async flow of that code because async is only really needed for Io you're calling you know Network um sockets or you're reading from files but as soon as you've got that data and it's buffered into memory in some form inside a stream or a memory you know memory stream or something at that point you can switch to non-async code and so in this application here the first thought might be that you could create a local or you can't uh by slicing

the data that you want to pass in there so the way we work around this is just passing directly that slice expression in um and this is totally fine for the compiler and it can handle all of that and then all of our optimized code can be spam based in that a non-async method so I want to put this one into practice with um a slightly more real world example um so actually if I just go through this slide just so you know what the example talks about that would help um so this comes from

a previous job I worked out where we did uh a lot of message and event uh microservices that were processing stuff and so in this sample it read from a sqsq read the message it needed to deserialize that message and then we basically stored a copy of the the Json message onto S3 which is the Amazon blob store and to form the object key which is basically the file name the unique file name on there we use certain properties from the message to derive the object key and so I promise we did all of the

things I said we benchmarked we measured we profiled before we did the work and I'll show you the results afterwards but this is uh it's not the exact code but it's close to what was in the original application so the first thing is this accepts this event context here so this is basically the deserialized um Json message from um uh the sqsq in this example it has a few properties in the real example it had several hundred but we were only using a few of them to derive the object key um there was a caveat

in the app at the time that we didn't always guarantee that there would be a date on one of the fields so we had to handle that scenario so I've kept that in because it keeps the sample a bit more real world so basically the original code um worked out if um if the date was present we were going to have five elements making up the object key otherwise it would be four we then created a string array of that number of elements so we were going to hold five strings or four strings depending on

that scenario and then we populated each of those elements in that array of strings using this get part me for taking a string get part basically did a quick null or empty check to check if the input was null or empty and if it was it would use unknown as the the value in its place we then remove the spaces and remove spaces down here just use string replace so remove a space character and replace it with an underscore and then the is part valid code was just using basically a Rex match to say is

this valid um values that can be used in an object key in AWS um and if not we'd use an invalid part otherwise it would return the the part so this would build up those um this array of strings that were going to fil form the portions of the object key we're going to populate we' also if the date was present just do a two string on there to get the string elements that we wanted in there the final element was the message ID with the Json suffix and then these were joined using string join

to join all those elements together with the uh Slash character as the separator and then it was lowercase and then that produced the object key so a lot of you might be able to spot where there would be some intermediate allocations this code for most apps that's probably okay in arario we wanted to try and work around some of those costs and so in this code which if I just scroll it first so the new code is a little more of a Bose but not too much more there's some weird stuff that goes on at

the top um so the first thing so I've updated this demo at the time I did it we were doing it in like I think it was net 31 I want to say um but I've updated this code recently for net 8 so I'm taking advantage of reject Source generators here which is one of the uh game that Steven tabber talked about around where the compiler does the Rex work upfront as part of the compilation process rather than a runtime through reflection and that's uh going to give you quite a good performance gain this weird

stuff here this static readon span of char with this array um or what looks like array of characters is a special optimization that the compiler has and the runtime has around storing this data into the metadata into the kind of blob of the binary um so that we don't have any he appications and these are basically just some of the pre um constructed string values that I want to use like the invalid and the unknown um that we can optimize in the main work happens down here the first thing that happens is the length is

calculated so I'm using a feature here called string.c create um what this basically lets us do is create a or or essentially mutate individual characters of the memory of a string during its creation which allows us to use high performance techniques like span um but it's before the String's returned which which is why we're allowed to sort of mutate that data but we need to know how long that string is going to be um upfront so this calculate length which I don't need to show you just does that and then string.c create takes that length

so it can pre-allocate some memory for the characters it takes in some state in this case our event context and then it takes in this special span action this key Builder action from down below which is an action here of um characters and the event context so ultimately what this means is that we get access to this span that represents the memory for the string uh in a read write form um and our data and so what we're doing here is we start by starting at position zero we do a build part taking in our

input data for the first piece of the object key the the span that we're writing into and then the current position in build part what that did did is basically do the same checks we did in the previous code but uh we're using a higher performance technique so rather than is n or empty we just check the length is if the length is zero or using this memory extensions all of the data is Whit space then we use unknown so that unknown part data that we stored in the blob gets copied into our span otherwise

the validation Rex gets called and then eventually if it's valid we use this memory extensions to lower invariant to basically copy from our input string into the span um and lowercase at the same time and the span that we're writing into we're using this range operator behind here to move our way through it so as we write each object key in we're going to update position or part of the object key we update our position by updating the length here um for the date we use Tri format which also accepts a span as the destination

so this is an optimized version of essentially two string um on a date um this takes the format that we want to use it Returns the byes written that we can use again to update How Far We've written into that span we copying the message ID and then finally five characters in the end we copy in the Json suffix so basically all this codee has moved us to to using span um as a method for uh writing our string um and as we'll see in the results this does actually lead to some performance gains so

in the original code uh this was taking around 300 NS to run and you can see that we've made a 26% Improvement um on that by moving to this now we weren't really looking for execution time gains we were mostly focused on allocations and so we can see that actually the allocation reduction on this is about 74% so from 728 bytes down to 192 um you might be thinking is that that useful um but at scale this was doing around 18 million messages a day so that's 10 gig of allocations in this service just for

this piece of code alone that we dropped out uh by doing this and that actually meant that we did see a reduction in sort of gen zero Heap um collections from the garbage collector um the actual benefits were far better at the time we did this at 31 I think um the ratio of performance Improvement was near 80% on the execution time but Don's got so much improved that the gains in this Benchmark are starting to go away so you don't have to necessarily change your code as much to get the gains from net as

you did when we wrote that original code the next type that I want to look at is the array pole um the name kind of gives this away it's a pool of arrays for reuse so in a lot of applications you might have scenarios where you're allocating an array shortlived array for some kind of buffer uh particularly if you're doing stream using the streaming apis you need these buffers you pass in but it can happen all over code bases where you just need a you know block of bites or a block of characters and you're

doing some work within that data um and ultimately producing a string or passing that data somewhere else by using a Rayle we can get the advantage of just avoiding those shortlived allocations and the pool means that we are going to use arrays that are going to be shared um amongst other areas of application and this basically means that we can amortize the cost of those shortlived allocations so you do end up with slightly long well slightly more um long lived objects will eventually end up in Gen 2 because they will be held by the pool

and and reused um but in most applications that will be uh worthwhile because you'll be seeing far fewer short-lived allocations of those so this is found in the system buffers name space um there's a shared implementation which is the one you're recommended to use for most scenarios so you call array pool of tea shared and then you rent from that the array that you want uh the length that you want the important thing to tell you is that you are likely to get an array that's larger than that length which might at first sound a

bit weird but that's kind of how the array pulls optimized um if it could return you an array of any length then it's going to be pulling a lot of arrays that will never be uh reused uh across other parts of the app so it has different bucket sizes of arrays that it's going to try and find the most suitable one from uh that it can hand out to you to use when we're done with it we return the array uh to the array pool we can optionally clear it so by default the array isn't

cleared um which is again something to be aware of so any data that you've written into that array will then be visible to whoever Rents It next and it also means that when you're renting arrays you need to be aware of it because again you don't necessarily want to read all of that data assuming you've written it and so when you work with um arrays that are rented from the pool you do need to track how much data you've written into it how many uh elements of that have you you know from zero have you

actually filled and then typically what you're going to do is slice to that position when you're then processing that data and that ensures that your codebase is only ever reading the portion of that rented array that you know is your own your own data um so that's yeah an important warning um that can trip a few people up but otherwise switching to array PS is pretty trivial uh and so in this again very simple simplified example we have a method that's going to get called very often by some of our code maybe from a loop

somewhere further up and in there it needs to create a bite over a thousand elements that it passes into another method then they done some processing using that buffer now this code obviously allocates that array every time so if it is called from a tight loop we're going to create these short-lived um thousand by arrays fairly regularly um so we can switch away from that by instead moving to using the shared array pole and then renting so in this case we rent a th now because of the way the sharol is implemented I know that

the first bucket size that could possibly be fulfill that need is going to be 1,24 elements long um and but it might be even bigger than that if the next size up happens to have a free array um and 1024 is already empty in the array pool then it might give you a larger one so we need to make sure that in our code further down that we are tracking our position within that array and either slicing it or or processing only the portion of it that we're writing into we also need to make sure

that we return it um because otherwise what's the point of having the Rayle if the data just is going to you know never be returned to the pole so typically the pattern for doing this is using a trif finally block um to make sure that we always return it the next type I want to touch on I won't go too deep into this because it's a little bit more Niche where you might use this but system IO pipelines was introduced um originally by the aspet core team so their main scenario is that they're in their

kestal web server for aset core they're receiving a lot of bites off the wire and then those have to be paed for the HTTP content uh to then further process that within aspnet core and so that process means traditionally that would go through a number of sort of intermediary streams within aset core the team worked out that they could optimize all of that stream usage to try and remove a lot of the overhead and the allocations but it was a lot of complex code and so what they did is packaged that up into this um

pipelines package that in their scenario at least for asbet core was giving them around a 2X performance over using traditional streams in the kind of traditional way but as I say it's packaged up for us so that we would we can reuse that same sort of Improvement if we're doing streaming data without having to write that code ourselves we could technically write it um but it's a lot of boiler plate and so it's it's just available in this package the main difference versus using streams yourself for this stuff is that with streams you use you

create the buffers you create the memory that you pass into them and then the data is handled within those streams pipeline does all of the buffer management for you and so it can optimize that it uses the array pool for uh where it's getting that data um that memory for you um and it ensures that that's all sort of returned as soon as possible um behind the scenes there's as you might assume two ends to a pipe there's a writing end and a reading end so on the writing side what we get is this pipewrench

um because we're likely to be in an async method reading off of a network you know or off of a file or something so we get memory of bite but you can use the same techniques I showed you earlier to switch to non-async code and span F filling that if you want to once you've written some data in you advance by the number of bytes that have been written and then you flush in an async way so that flushes the data into the pipe on the reading end the reader has this Ray sync method that

we can call and so this is a non-blocking way of saying as soon as there's data in this pipe available you know give me give me the data that's been written or or flushed in so that I can start processing it so that we can highly efficiently stream through this data with both ends sort of reading and writing at the same time the read async method returns a read result and on the read result we can access this thing called a buffer and this gives us this readon sequence of bite so why readon sequence of

bite on the way out but a memory on the way in the main reason is that the pipe in advance doesn't know how much data you're going to give it so it's going to rent some memory from the array pool um but it's going to sort of pick a pick a block of memory that's available and start filling it but you might stream you know a 100 bytes or you might stream 100,000 bytes off the network and so at some point that little block of memory that's been rented could fill up and so internally what

this looks like is basically the pipe during the writing side is just going to keep renting these blocks returning them as memories of te to you that you're filling up but it handles all of that internally and then on the reading end what it gives you is essentially a linked list of those blocks of memory refer to as this readon sequence and so that's the data that you can then pass through in the correct order so the next scenario that sort of demonstrates this uh again from a previous role um and from pretty much the

same microservice I think um was where we were receiving um an object from S3 The Blob store in AWS uh it was a tab separated file uh in sort of compressed format and so our job was to open the file decompress it and pause out three of the 25 columns within there that were going to then be stored as an object into elastic search so again I'll show you the before and after code just so that we can kind of get a feel for uh what that looks like so the before code is actually pretty

T it's just that um blocker code there um if you reading ahead of what I'm talking about you'll probably start seeing where there are problems here so this was written by colleag we hadn't you know we hadn't noticed this code going in um the the first indication that this service had some memory issues was that it was starting to hit container limits this was running as a scaled container um and we had a sort of reasonably High memory limit set but we kept hitting that we're like this doesn't make any sense we keep increasing the

memory limit and we keep hitting it uh so we looked at the code and we can you can kind of see the problem if you stare at it so in this case I'm using a file stream rather than AWS but basically if you read a file from AWS you get back a a stream we then use gzip to decompress it that's fair enough where things got a little weird was this creation of a memory Steam and then this copy from the decompress stream into the memory stream sorry that's happening here and then this decompressed stream

that we've copied into is then cast to an array or stored as an array in order to pass it into UTF encoding utf8 to get a string so at this point We've Ended up after various copies of this memory with a string that represents the entire contents of the file these files were 10 to 20,000 lines long of You Know Tab separated data so that's a reasonably large string so all of these are actually large object Heap uh style allocations then that data was passed into this cloudwatch paa and the reason I think at the

time for the string allocation stuff was that the author of this code was using tiny CSV piler which I think at the time maybe only dealt with strings I'm sure it doesn't now um this tiny CSV piler made it very easy to pass CSV or tab separated files by basically using this sort of Link style syntax here to read through split on the lines and then had this mapping syntax down below to say map me from these columns uh 0 1 and 10 onto this object and give me the object back so it's quite simple

code that's easy to use because of the use of this Library um but if you look at you know just look at the code above you can kind of see where there's a lot of copying going on so we had a crack at rewriting this and the new code is uh yeah a little bit longer uh might be a little harder to understand what it's doing um but we optimize the way a lot of the the overhead so we still have to decompress the file so that still kind of happens as before but as soon

as we got that decompression stream we use pipe reader create here to actually start creating a pipe now strictly speaking probably didn't need a pipe in this scenario um because we only dealt with that data once but actually um we did think we were going to do some further processing through further parts of that pipe reading process that made it more reasonable at the time but it uses the same code that we just looked at so we call Ray sync we access the buffer once we've got some data and then this PA line method here

takes that buffer the buffer is that readon sequence of bite um to make it easier to work with these sequences because there's a little bit of complexity about understanding if there's one element within there or multiple parts of that sequence uh this sequence reader struct was introduced and we can create an instance of that to help us read through the sequence so that has convenience methods on it like um try read to any so we can say try read to any instance of a new line character and if you find that we know we've got

enough data within this current sequence that gives us a line from the file and so it returns us that line and so this Loop here is basically going to read until there's nothing further available from that sequence reader if we don't have a line we break and then we wait for more data to be flushed in but once we have at least enough data for one line we can pause it and at this point we're working with a readon span of those bytes and then this new code basically pre-re Ates the in the type that's

going to be stored into elastic search that gets populated with this code it then loops um checking for the index of the tab character and then it takes the strings that we need the three columns we care about so at tab count 0 1 and 10 we're going to extract the string by slicing that data that precedes that tab character this code below um is just going to update our view over that uh line as we move forward so it keeps slicing it so as we've as we sliced the tab it gives us a new

view containing the remainder and then we update the tab count on it each iteration and this Loop can break after we've read the first 11 tabs because that that's the ones we care about after that we don't need to pass the file or the line any further um that's pretty much it for the code so I'll go over to the results so in the original uh uh processing or benchmarking of this we we use one file of 10,000 rows to give us a vaguely uh reasonable real world measurement of the overhead so you can see

the the execution time has gone down but we're still only talking milliseconds so execution time wasn't our issue but we've got a nice 80% gain anyway the issue for us was the allocations and you can see over here that original code just for that one file allocating 100 megabytes uh as I say most of those on the large object Heap as well requiring a Gen 2 collection later on to clear them up um and uh as you can see from the numbers in that Gen 2 column you know there's a lot more allocations occurring in

Gen 2 in that original Benchmark in the new example we were down to 3.32 Meg so a good 90 7% reduction you might think oh well 3.32 Meg still pretty high but actually once we then did uh memory profiling of the optimized code we checked and actually 2.85 are the strings that we expect to create we needed three strings from each object from each row that we then used to create the object that we store into elastic search and at the time those were unavoidable so those we accept the remaining overhead 45 Meg some of

that's coming from pipelines some of that somewhere else um I was quite Keen to dig deeper but uh colleague pulled me back and said actually we don't care about that little extra uh allocation I wish I still had access to the code because I'm still curious to see if we could get down to nearly zero overhead um but you know sometimes as I say once you've optimized to a certain point your goals achieved moved on um so I I thank my colleague for stopping me going down that that rabbit hole so the final API is

system Tex Jason uh how are we doing on time good we're doing we're doing well um so we we'll quickly talk about system text Jason so this again has been introduced around for a while this was introduced I think in netore 3 and it's a new set of in thebox apis for working with Jason data I think most people have probably heard about them now at the time there was a big controversy because we had Newton soft Json nice open source library for Json processing that nearly every app used if it worked with Json and

and this was a sort of a a sort of a flag that Microsoft was sort of stamping on open source and and creating their own thing now James Newton King who created Newton soft Json uh did sort of validate that this had Merit um now partly he was employed by Microsoft already at that point so you know maybe he had to say that but um there were some good reasons so one of the things was newtoft Json was written well before things like span of tea and pipelines were available um and so it would be

a big rewrite to fully optimize it using those new new Concepts the other thing was by this point in time you know Jason has really kind of become de facto in most applications many apps particularly aspnet core apps that are doing web apis for example they're all going to be needing some kind of Json paing and so having something in the box from from net supported by Microsoft optimized maintained by Microsoft does have some Merit in that regard because it's quite an important set of apis that most other languages or most other Frameworks package for

you but whether or not you're happy with the idea um the performance gains do um do tend to make it worthwhile so it's made up actually of three main levels there's this low-level apis the utf8 Json reader and the utf8 Json writer um if you drop down this far you can get to almost or generally zero allocation overhead sometimes on paring uh bites of Json data in an application but as we'll see when we look at the demo it can be quite complex or or quite deep code there's a mid-level offered for reading uh Json

data in this Json document format so that allows us to more easily work our way through a Json documents format looking at the elements that we want to find locating a particular property within a structure of Jason in a quite efficient uh sort of single read way and then there's this high level um API Json serializer and on there we have the things like serialize async and deserialize async the traditional types of apis you'd expect from ad Jason uh Library most people will just use the high level and that will give you generally performance gains

over something like Newton softas because under the hood it's already been Rewritten to use things like spans and um you know pipelines and efficient sort of memory processing behind the scenes and so just by switching to this Library uh you can get the gains it used to be the case that the apis in system textas were quite limited um since that sort of Freo time frame the team have really sort of built on them and so a lot of the functionality that was available in Newton soft Json is now possible uh in system taxt Json

I'm not saying the apis are 100% compatible you're going to have to rewrite your code and update it uh if you do the migration but most of what you needed to do is probably possible now so to put this one into uh sort of practice uh the example we had for this was we were um at the time working with elastic search and we were indexing a lot of data from these messages into elastic search and we used its bulk API to do that and in short the bulk API allows you to send a new

line Del limited uh set of Json data of operations to perform so rather than sending uh a right request for every document you want to store every uh of piece of data you want to store you could package those up into you know 100 or a thousand right operations to send in one HTTP core and so you're reducing your HTTP overhead quite significantly if you've got a lot of data to Index this bulk API returns you a Jason response and on that response it tells you did any of those operations fail because it's going to

try and do all of them for you um and if one of them fails um but 99 out of 100 succeed you need to know about that and so you need to check that errors property to see if the errors it tells you also how long it all took and it gives you the detail for each item for each operation um the specific details of its success or failure status codes and things so what we needed to do is check that bulk response and look at all the IDS that failed so that we could put

them into essentially a dead Le letter q for further investigation didn't expect that to happen very often but it was still needed to be considered so uh oh don't want to don't want to steal the funer and give you the results early I want to show you the code so the original code for this is is like super it fits on one screen even at this stupid font size um so basically this is newon oft Json so we ultimately ended up creating this Json serializer and deserialize using that into this bulk response so this says

you know give me the bites that have come over the wire and then deserialize them into this type and you can see it has the how long it took the errors and then the items which would be the results for each operation so we first deserialized that response into that object model with then check um were there any errors so if there were no errors then this method which return returns this value tpol here can return success true and an empty string for the number of for the errors that were reported so in the happy

path which we expect most of the time that's what's going to happen if there were errors then we need to look through the items that were returned find those with the 400 status code for example find the IDS and then return those so that we can identify which of those messages failed so pretty tur code pretty straightforward the new code as you can see is longer um and this is why you know this is important around this readability versus maintainability if you don't need to optimize and newtoft Json is working fine and the app's perfectly

happy probably stick with it or or most switch to the high level API of system text Json um but if you really want to try and get the high performance as we did just for an exercise I think at the time and then what we do is basically we're working with the streaming API we could have use pipelines here but we're using the array pool to rent a buffer it's hardcoded in the demo because I know how much the data is going to be um we read from the stream into that buffer ultimately then we

go into this pause errors code here which takes a bunch of State essentially and then it uses the lowlevel utf8 Json reader API here to start reading from that um uh readon span of bytes from the the Json data bytes and so what this does it's a re-entrant type so it can take um some state of its own so you can keep calling it as you if you're streaming data as you're getting more bytes from The Wire and what you can do on there is call read to read the next Json token and then in

here this code is basically switching then on specific Json tokens like start object start array Etc and because we know the structure of the Json we're expecting we can highly optimize this code and so basically what this is doing is working its way through tracking its position until we've read a property which is got the name errors so we know we found the errors property and once we've got that which we set the States of flag that we have that um once we're in this the next token we're going to get is either true or

false the the value for that and so we're hoping that it's going to be errors false and at that point we can set this other piece of state and so what this code will do is break once we found the errors property and we' found that there are no errors it breaks otherwise it has to keep going because it's going to need to read all the items to find all of the IDS of the failures but that's essentially I won't go through all of the code what that's doing um and so this code above then

after reading part of the data that's been streamed in if we found the errors and there are no errors we don't even stream the rest of the data because we know we don't care about it and we can just stop streaming in other scenarios we have to stream further so quickly show you the results before I wrap up so in the failure scenario the um Improvement Optimizer code was 67% in execution time again mostly we were concerned on allocations and you can see we've gone from about 100K to about 16k um allocation overhead but that's

the failure response that's the response we don't expect very often the one we do expect most of the time is that it succeeds and that's what we wanted to optimize for and so in that scenario because of that short circuiting code that basically just ends up reading the first property of Json that comes back and then breaking out and no longer streaming it we're down 99.9% in execution time because we just short circuit in 200 NS out of this this processing code and the overhead is now zero because we've just we've already streamed the data

the actual processing overhead is zero because we're just reading through those tokens using the lowlevel API so that was a pretty good gain that we could get there by dropping to those so I've got like a minute or so left uh so quickly to say how we get business buying to do this you maybe like what you've seen today so when you go back to your office start thinking about where you've got quick wins if you got an app that's always you know people are always complaining about how slow it is or how slow a

particular part of it is maybe that's somewhere to look at optimizing be scientific about how you do it so you know get the benchmarks get the profiling to validate where you should improve and how long the existing code takes to run if you're going to them business management though don't just say I can save 87% of the bite allocations on this method because most managers above engineering are probably going to go yeah I don't care so show them some money they can save give them a cost to benefit ratio so a very basic example of

that for the service we were looking at where we could use a lot of examples I've shown you today we worked out we could reduce about 50% of our allocations using those techniques and we could roughly uh double or per instance throughput of those container services when they were doing processing what that ultimately led to is saying that we could drop one VM from the cluster a year that was under that container cluster saving about $1,700 that might not be enough to justify the engineering time that went into it so do be aware of that

but if this can be scaled in our microservice environment where we had hundreds of these doing very similar things then we worked out that maybe we could save it 10x or 100x times across those services and that game might be more important and my daughter's coming to the stage hello yeah Daddy's nearly finished um so quick summary byebye um sorry about that uh so the important thing is uh when you're doing your code do you want to come here yes okay um measure your code don't assume um don't make assumptions with benchmarks it's dangerous uh

be scientific about how you analyze the results so use you know techniques that are like benchmark.us on the hot paths I'm glad she's agreeing uh focus on the hot paths don't spend your time on methods that are called once or twice in your application lifetime those hot paths are where you want to focus on the time uh don't copy memory slice it and that will ensure that you're um avoiding a lot of the overhead and the time spent working with that data uh array pools are really useful and an easy switch to avoid short lived

array allocations yes um pipelines can be quite good if you're in a niche scenario where you do a lot of data streaming and consider system text Json apis for um high performance Json data so the last thing I want to show you is this book uh I learned a lot of the techniques about memory management from this book it's about that thick it's about 1,100 pages in English uh it's written by Conrad it's a really good book to have I have the physical copy because it's also good as a weapon if someone breaks in at

night um and uh yeah tells you a lot about how memory works inet but also talks about span and how the implementation works yes you can go so thank you for listening um I'm going to pop you down there goes mommy um and if you want to follow me or ask questions afterwards that's where you can find me and that link is all the slides so thank you very much um I don't have time for questions on stage but grab me I'll I'll hang around if you have questions um thank you very much for coming