AWS Re-Invent 2021 | Designing With Simplicity At Amazon:-Hi! I’m Colm MacCárthaigh, I’m a senior principal engineer here at Amazon Web Services, and today I’m going to talk about how we design with simplicity at Amazon. This is part of our Amazon Builders’ Library series so, if you’re not aware already, actually we launched this re Invent last year in 2019, where we’ve been building this repository, our library really of interesting articles that are from firsthand experts here at Amazon, and their experiences building, designing, operating, running things, lessons we’ve learned over the years.
There’s also cool articles there, go check them out.Probably across all those articlesyou may notice the theme of simplicity,it’s something we try to foster, and encourage a lot at Amazon.In fact, it’s actually one of our Amazon Leadership Principles.So, we have a series of Leadership Principlesif you’re curious about what they allare you can just search for Amazon Leadership Principles,and you’ll find the full list.One of them, is called Invent and Simplify,and we encourage ourselves,and each other to constantly be inventing things for customers,constantly innovating,and coming up with new solutions for customers’ needs,and new cool things that we can offer them,and delight customers with.
But at the same time, we want to lean on simplicity.We want to make sure that as we build an engineerand create new things that we don’t add huge burden of complexitythat’s going to wear systems down, and about their technical systems,and make them very hard to use and operateor also you know our business systems,and just make it kind of hard to run a business for customers.So, we pair those themes together at the same time;as we’re inventing, as renovating,as we are engineering always be striving to simplify,but simplicity is a tough thing to really get into.Really begs the question, what do we mean by simplicity,what is simple? A lot of experience hardened,folks who have built a number of things do get to know,and get a good feel for simplicity,get into a kind of you know it when you see it.
It can be pretty hard to have a very straightforwardset of rules for if you apply these tests,and if you meet those conditions then it’s simple, then it’s complex.People do try, and there’s some common definitions,and some common ways of thinking about itprobably the two most common ways to approach this,is the thing about one definition of simplicity is it’s,things are simple when they have fewer moving parts, right?That’s kind of a very engineering kind of approach,that’s a very kind of mechanical way of thinking,you’re looking at the system,and you’re literally counting moving parts,and you’re saying, “Well, this system that has two gearsis simpler than that system that has 10 gears.”That can feel intuitive, fewer components meansfewer things that could break fewer things to understand and so on.Another definition of simplicity that’s very commonis just simple means something’s very easy to understand,and easy to use, right?Something somebody can quickly adapt to and get a feel for.I actually really want to emphasize that second definitionthat kind of humanistic definition of simplicity,as we go through what simplicity means here at Amazon.I’m going to cover some examplesof how we’ve built some systems that I think are simple.
Because I think a great way to really get a feel for simplicityis to look at examples,and to learn from what others have done beforeand to see what have those systems done over time.The definition I’m using in my head is kind of that one,like how easy to understand? How to use these systems overtime?I think a really good illustration of this,and why this back-definition suits me better,and suits us here better in Amazon,and this is what we mean by simplicity is,if you look at two designs for locomotion,a unicycle, the bicycle.
A unicycle absolutely has fewer moving parts than a bicycle for sure.You can count them, it’s got one wheel instead of two,usually does not have gears,just paddles directly attached the wheel and stickand a seat, incredibly simple machine if you’re just counting moving parts.But there’s no way that that’s really a simpler solution for locomotionfor getting from A to B.A unicycle is incredibly hard to use, you got to have pretty good balance,and you got to have put in quite a decent number of hours to really master it,and get to the point where you’re not falling over,and you can get from A to B.Meanwhile, a bicycle has two wheels, gears,there is more pieces of aluminum or steel involved,and anyway you count it, it’s got more moving parts.
But it’s much, much more successful design, right?It does take a bit to learn to ride a bicycle but most even kids can do it in a day, you can learn how to ride a bike,and you can kind of easily integrate with training wheels and so on, but after that it’s a very usable machine very practical,and has stood a very long test of time,and you know a bicycle is definitely simpler,and better than a unicycle if you measure it that way.You can extend that argument again and say,”Well, a tricycle with three wheels,that’s even simpler than a bicycle, right?Because you can get right on the tricycle, and get going,you don’t need to learn any of the balance it takes to ride a bike.
“But then you start to see well the tricyclesprobably more complex system,and actually adding that extra wheel, creates all sorts of problems like;it’s just harder to store somewhere,it’s harder to maintain all that kind of stuff.So, simple often tends to be some nice, sweet spotin between a set of tradeoffs.
To look at these designs to just kind of the lands of mechanical elegance,the engineer, the mechanicalness might just look at the unicycleand say,”Well, isn’t that a beautiful, elegant machine?It’s got very, very few moving parts, and you can get going with itas long as you’re willing to put the time in.”But I think this example showsthat elements can be really deceptive,it’s better to have that more humanistic take off.Let’s look at how it succeedsas a device actually being used in the real world,and let’s have that beer measure of simplicitybecause that ultimately what matters, and why does that matter to us?
Why do we value simplicity so much?I think our experience is that simplicity is highly correlatedto some very desirable properties.Security is our top priority here at AWS,and simple systems tend to be easier to keep secure than complex systems.There’s just fewer points of entry, smaller attack surface,if you want to think about it that way,there’s just fewer places in the designwhere there can be mismatches between one coder’s expectationsand another coder’s expectations,and where you might have leaky abstractions,and so on, where security issues can arise.
You can help minimize all of that by keeping systems simple.We also value durability a lot, we don’t never want lose data.So, want to have a simple repulse mechanismthat are making sure that we’re not going to lose any data.Reliability, we value that incredibly highly,and what I said earlier is still true,the fewer moving parts you have, the generally less that can go wrong.
My example covered have two few moving parts can be a bad thing,you still want to have as few parts as you really need, right?Because you want your system to be reliable,and you don’t have a huge number of components where finding them go wrong, you are in trouble.Simple also tends to keep costs down, simple systems usually easier,faster, cheaper to build so, we can pass on all those savings,and have cheaper offerings which everybody likes.
Simple things are just easier to manage,I mean that’s almost a tautology.The way I’m defining simple is by saying it is really measured by,easy to use, operate, and manage things at,and that’s just always a desirable thing in any line of business.
Then, the other kind of reason why I think we wantto emphasize simplicity so much,it’s just because we think still in the early days of the internet,and online systems, and highly available systems,and we always want to be first to find those really great solutions,and components that are going to really stand the test of time.Things that are going to be reused over,and over, and over for decades, and power the industry,and make sure that we can just have incredibly highly available,and very highly secure systems that can be usedin all sorts of blocks of life and business.Being able to get to that quicker is just a huge saver for everybody,just advances the whole industry forward much, much more quickly.It’s a big key to our long-term successand just a fact of life at least, working here at AWSis that we’ve been growing quite quickly as a business,as long as I’ve worked here, I’ve joined in 2008.
So, you’re just constantly hiring folks,and expanding, and growing teams,and the set of folks who were on the service team,three or four years from nowmight be very different from the set of folks now.So, just constantly trying to build systems that are going to besustainable, maintainable, and work across those boundaries.I’m going to cover five examples of simple like I said.
Each of these examples, I’ve kind of picked them from our history,there are a diverse set of examples that cover different thingsbut what they each have in common I hopeis that they’ve actually stood the test of timeand have helped customers solve problems.The first example I want to cover is Availability Zones.Before Availability Zones came along, before we had Availability Zones,customers and also, we ourselves would struggle to really think about,”How do I build meaningful resilience and redundancy into my system?”
I know that I need to have some redundancy or failover capability,but I also need to make sure that if I’m going to have a problem,it can’t happen to both my primary, and my secondary at the same time.That would be bad, I would have an outage, no one wants that,but I also might have other layersand other systems that have the same requirements,and how do I reason about all of these at the same time?I need to think hard about shared fate, and about avoiding it.There are things, they don’t happen very oftenwe are incredibly good at managing them,but you know there can be power issues or networking issuesthat can take out an entire building at the same timeor there’s extreme weather events or earthquakes and so on,things that are completely out of anybody’s control.
They can also impact things, the level of a physical location.So, builders just need to be able to reason about all of those.The Availability Zones are just an amazing simplifier.Instead of having to think about this gigantic sea of infrastructureand checking every single little node that you pick.Well, is this one really redundant as compared to this other one?Do they possibly share fate?We just instead have Availability Zones, and we tell customer:”
Hey, if you’ve got two things in this same Availability Zone,there could have a problem at the same time in theory somedayso, don’t make one redundant for the other,that’s not going to work,”very easy to understand and use in my experience.We use this concept over and over.So, we invented it because of the corelow level infrastructure risks like powernetworking as I mentioned, but we figured well since we have it,we might as well also use it for things like deployment risks.When we deploy new software,we do it just to one availability zone at a time,and then given region, and so, there’s any problemwith that software it can only impact that one Availability Zone,and customers still get the benefits Availability Zones.
We’ve done other things over time to like occasionally we’ve alignedhardware types with different Availability Zones,like you might use a component made in a certain factoryhereby a certain manufacturer in this availability zonebut not the others.If there’s any kind of core-related issueor problem with the hardware type,that also gets the benefit of kind of Availability Zonedesign which is really neat.
Over time we’ve actually even learnt to simplify this concept furtherwhen we first made Availability Zones available,it was actually, you as one customer might get USC 1Ais a certain physical building in a certain physical data center,but my USC 1A is a different physical buildingin a different physical data center,and the reason for that was we were little worriedthat everyone might just choose Abecause it comes first in the console,and we’ll get all this infrastructure just landing in the same spot,and we’ll have this terrible imbalance of resources.
So, we randomized the database across accounts.Turns out over time customers understood Availability Zonesthat got really good about placing infrastructure diversely,and their tools, and other tools we built,which is kind of naturally spread thingsas we didn’t need that anymore. We were able to change to a model,where for new accounts at least, everybody’sA is the same everybody’s B is the same.So, you can even reason about these Availability Zonesacross customer accounts, which is another nice simplifier.My next example is in a totally different area,that’s all the physical infrastructure,and how we model redundancy and keep things separate.My next example is actually related to securityand its SIGv4 authentication protocol.
That is what powers authentication, and authorizationfor almost every API request that is made to an AWS service.As you can tell from the name SIGv4is not our first attempt had an authentication protocol.We had some others before actually really onlySIGv2 got significant usage outside of Amazon.
But we’ve made refinements and simplifications across time,and landed on SIGv4 quite a long time ago,I think over 10 years ago now.They are as core way all API requests are authenticatedbut it’s also available to customers.So, if you want to use SIGv4 to authenticate your own requests,you can, it’s part of the API gateway.I think what I find simple about SIGv4and why I think it’s kind of stood at a really goodlong test of time is, it uses cryptographywhich is normally a very complicated, complex thing,and you see a lot of changes in cryptographic systems overtime;you know think about SSL or TLSpeople are going through many versions,and choosing different algorithms and so on.With SIGv4, it’s actually implementedin an incredibly small amount of code,it’s a very simple algorithm. You could write an implementationin a SIGv4 in a few hours, if you’re programmer.We have the specification published if you ever wanted to replicate it.It’s all built on top of cryptographical algorithmcalled HMAC-SHA256,I’m not going to explain what that is or how it works under the hood,but you can go through our diagrams of how SIGv4 works,if you’re curious on the website.
But at a high level we take a request when it comes in,we can’t canonicalize request headers into a canonical order,then we derive some signing keys from like the current time,the region, the AWS access key, does all of these,uses at HMAC algorithm that I talked about and authenticates the request.For something like authentication, it is stunningly simple,there’s very little going on, and it’s very robust,and that also means there’s just very little surface areaexposed for any kind of potential security issue,before a request is authenticated.It’s been really cool, I just wanted to call it outas a kind of an example of a really simple designthat’s kind of survived a good, nice test of time.
The third example brings us backto traditional infrastructure resilience, and redundancyand that’s DNS health checks.A long time ago we decided to make our own services extremely,highly available, and reliable that we would use DNS health checks,is this kind of cold primitive for ensuring things are highly available.We’ve actually built DNS health checks directly into Amazon Route 53.It’s a feature you can use as a customer,and many do, and uses very successful,but we also use it ourselves a lot.
Amazon Elastic Load Balancing, Relational Database Serviceand other services are actually creatingand managing Route 53 health checks on customers’ behalf,and that’s how they are ensuring that if ELB nodes have an issue,it’s DNS Health Checks that step in, and route around that issue,or if an RDS primary database has an issue for exampleis DNS health checks that kick in,and make sure that the secondary takes over.Because these health checks are happening all the time,Route 53 is constantly health checking these targets,whether it’s your own EC2 Instances or ELB nodes or RDS Instances.Because this always happening, it’s a very highly reliable system.If we even have seen an entire availabilityzone suffer power issue, and so we lose power,and those instances become unavailable.
This DNS health checks notice right awayand stop routing traffic right away to those instances.There’s no need for us to make any API callsor reconfigure Route 53 in anyway, it just does it without intervention.
That’s been something very simplethat we’ve been able to build on top of over and over,and strongly contributes towards the high levels availabilitythat we’ve been able to achieve.Next simple pattern that I love, and see a lot at Amazon, is Rollback,by which I mean when you are writing code at Amazon,obviously you’re expected to write good codeor make you and your best attempt to write something that’s bug free.None of us are perfect, especially me so we all make mistakes,and then the next kind of process that steps in is well code review,and testing, and as a team making surethat we’re catching problems before they ever get to production.
But as hard and hard as you try to get to perfection there,there’s just always some risk that there’s something in your codethat when you finally do push it to production,you only going to notice then because it’s triggeredby maybe a very rare set of circumstancesthat were hard to cover in testing.Occasionally a few customers might triggeror just some difference in the infrastructureor nature of production itselfor triggered by extreme amounts of loud, and so on.
So, we invest a lot in deployment safety here in Amazon,we have Builder’s Library articles about that,so we like to start our deploymentsand start small and increased deployment size as we get more,and more confidence in it.But what we’ve also learned is that actually fast Rollbackat every stage is really good key to success.
If you’ve ever hit operational problems,probably the first question we’re always asking is,”Okay, let’s roll it back!”We’re not going to waste time investigating why it’s failing,we’re just going to observe,”Hey, there is a recent change in this area, so let’s roll it back.”We’ve kind of baked that way of thinking into our team culturepretty deeply, teams and developers know that from building a changeor getting new version of a software out thereit always has to be ready to roll back,and I have to think about that, and be prepared for that.
For example, if I’m doing something like making a schema change,which is sometimes is a breaking changeif you do it very naively; you can’t really do that at Amazon,you have to split it up into a phase changewhere you add a new columnor whatever you’re doing to the schema,and then you write in both the old formatand the new format for a while so that roll back would always work,and you push another version that stops writing to the old one,but you could still roll-forwardand Rollback even for that deployment if you wanted to.
So, it takes a little bit more effortto get your full change out there takes a few more steps.We value Rollback so much that we go to that effort,we insist on it.The last example I wanted to cover something we called Static Stability,which you can find in their Builders’ Library articles.The concept of Static Stability hopefully will sound very simple,it’s essentially if the system does lose power or networking,it should come back more or less to the state it was previously inbefore that power event or before that networking event.
There are a few exceptions to that things like primary databasemay have flipped away to a secondary,and if particles back you don’t want itto become primary again right away, that’s not appropriate.We do want the machine to boot and workand the application to start, and health checks to start passing,and on all of those kinds of things. This is harder than it looks,especially when you’re building highly distributed systems,especially when there’s all sorts of dependenciesyou could be using it depending on, you got to think carefully about,”Well, what if they’re not quite available yet as I’m booting,and what if they also were impacted by this same event?”This is a really tough case of this,for encrypting durable data where let’s say,you want to encrypt all the data on a machine.Well, that means you need the encryption keyon that machine, right?
You can’t start encryption key in the encrypted volumethat doesn’t make sense then that’s a circuit of dependency,when you first boot the machine how did you decrypt the volume?
You don’t have the key yet, right?So, you have to keep the key somewhere else.A traditional solution to this is,”Well, we’re just going to hide the key, we’re just going to stash itsomewhere on a plain text volumethat isn’t encrypted than we’re just going to give ita name that’s not easy to figure out or whatever,”but then security engineers don’t like that,security engineers like,”That’s terrible you can’t, that’s not appropriate.”The other solution then is, “Okay, well, I can’t keep it on boxthen I’ll use it in memory for decrypting the diskbecause I need it,but otherwise it is just not going to be on the box at alland when the system boots were going to fetch that keyfrom all some key management service right,and the key management service will authenticate me somehowand give me the key; and then I can decrypt the disk.
But that’s not a particularly statically stable pattern,what if you boot but the key management service isn’t ready yet?That gets pretty hard.We have invested to the point where our AWS Nitro security chips,which are baked into our EC2 platform,they have this secure enclave baked in,they can actually just keep these keysin a way where they’re not accessible,but they can be used, and so you can manage this tradeoffbetween security and availability. You can have that static stability,when the system boots you can actually get the keybecause it has a copy of it in this enclaveand do the disk encryption in this decryptionthat it needs without needing to worry about it,which is nice, a tremendous simplifier.
That’s my last example.Hopefully from those five,I have covered some kind of inspirationfor what simplicity really means to us,and how we think about, you know we just keeping thingsnice, plain, and usable and hopefully you’ve taken awaythat like you humanistic meaning of the emphasizing,and really want to focus on how these things work for people.
We want to build systems that are as uses few moving purposes as necessary, that’ll be filtered through the survival of the fittest process if something doesn’t have enough components, and if you can’t cope with stressful events, we’ll eventually have to add a component to make sure. Let’s say, study more examples find the many where you can, other technology providers, other folks out there who’s shared their own. Study those too, because I think it’s the best way to learn, and share them with me too, I’d love to always find more examples.
Thanks for following along, thanks for watching everything in this talk. It really helped me, it really helps us, if you can also fill out the session survey, lets us know what everybody is interested in, and how we can improve, and refine all these talks for future new versions,