O’Reilly members endure live digital training, plus books, videos, and also digital content from 200+ publishers.

You are watching: ____ applications require that a record be accessed immediately while a client is waiting.

Chapter 1. Reliable, Scalable, and Maintainable Applications

The web was done so well the most human being think of it together a natural resource like the PacificOcean, quite than something the was man-made. Once was the critical time a modern technology with a scalelike that was so error-free?

Alan Kay, in interview v Dr Dobb’s Journal (2012)


Many applications today are data-intensive, together opposed come compute-intensive. Life CPU strength israrely a limiting factor for this applications—bigger problems are generally the quantity of data,the complexity of data, and also the rate at which that is changing.

A data-intensive application is typically constructed from standard structure blocks that carry out commonlyneeded functionality. For example, numerous applications require to:

Store data so that they, or an additional application, can find it again later (databases)

Remember the an outcome of an high value operation, to speed up reads (caches)

Allow customers to find data by keyword or filter it in various means (search indexes)

Send a post to another process, to be handled asynchronously (stream processing)

Periodically crunch a big amount of gathered data (batch processing)

If the sounds pains obvious, that’s just due to the fact that these data systems are such a successfulabstraction: we usage them all the moment without thinking too much. When structure an application, mostengineers wouldn’t dream of writing a brand-new data warehouse engine indigenous scratch, because databases are aperfectly an excellent tool because that the job.

But reality is no that simple. There are numerous database solution with various characteristics,because various applications have various requirements. There are assorted approaches to caching,several means of building search indexes, and also so on. When building an application, us still require tofigure out which tools and which approaches are the most proper for the task at hand. And also itcan be hard to incorporate tools when you must do something that a single tool cannot carry out alone.

This book is a journey through both the principles and the practicalities of data systems, and also howyou deserve to use castle to construct data-intensive applications. Us will discover what various tools have actually incommon, what distinguishes them, and also how they accomplish their characteristics.

In this chapter, us will begin by experimenting the fundamentals that what we room trying toachieve: reliable, scalable, and also maintainable data systems. Fine clarify what those points mean,outline some ways of thinking about them, and also go over the basics that we will need for laterchapters. In the following chapters we will continue layer by layer, spring at different designdecisions that should be taken into consideration when functioning on a data-intensive application.

Thinking about Data Systems

We typically think that databases, queues, caches, etc. Together being an extremely different categories of tools.Although a database and a post queue have actually some superficial similarity—both save data because that sometime—they have really different access patterns, which way different power characteristics,and thus an extremely different implementations.

So why should we bump them all with each other under an umbrella term choose data systems?

Many new tools for data storage and also processing have arised in current years. They space optimized fora selection of various use cases, and they no much longer neatly right into timeless categories<1>.For example, there are datastores the are additionally used as message queues (Redis), and there aremessage queues with database-like durability assures (Apache Kafka). The boundaries in between thecategories are becoming blurred.

Secondly, increasingly plenty of applications now have such demanding or wide-ranging demands that asingle tool have the right to no much longer meet every one of its data processing and also storage needs. Instead, the work-related isbroken down right into tasks the can it is in performed effectively on a solitary tool, and those differenttools are stitched with each other using applications code.

For example, if you have an application-managed caching layer (using Memcached or similar), or afull-text find server (such together Elasticsearch or Solr) different from your main database, that isnormally the application code’s obligation to save those caches and also indexes in sync v themain database. Figure 1-1 provides a glimpse that what this may look prefer (we willgo into detail in later chapters).

Figure 1-1. One feasible architecture for a data system that combines several components.

When you incorporate several tools in order to provide a service, the service’s user interface or applicationprogramming interface (API) typically hides those implementation details from clients. Now you haveessentially created a new, special-purpose data device from smaller, general-purpose components.Your composite data system may provide specific guarantees: e.g., the the cache will certainly be correctlyinvalidated or to update on writes for this reason that outside clients see regular results. Girlfriend are now notonly an applications developer, but likewise a data system designer.

If friend are designing a data system or service, a lot of tricky inquiries arise. Exactly how do girlfriend ensurethat the data stays correct and also complete, also when points go dorn internally? how do you provideconsistently an excellent performance to clients, even when components of your system are degraded? how do youscale to handle rise in load? What walk a great API because that the business look like?

There are plenty of factors the may influence the design of a data system, including the skills andexperience of the civilization involved, legacy system dependencies, the timescale because that delivery, yourorganization’s tolerance of different kinds of risk, regulation constraints, etc. Those factorsdepend an extremely much on the situation.

In this book, we focus on three involves that are vital in most software systems:


The system should continue to work-related correctly (performing the correct role at the desiredlevel of performance) even in the face of adversity (hardware or software faults, and also even humanerror). Check out “Reliability”.


As the mechanism grows (in data volume, traffic volume, or complexity), there should be reasonableways of managing that growth. View “Scalability”.


Over time, countless different world will work on the device (engineering and also operations, bothmaintaining present behavior and also adapting the device to brand-new use cases), and also they need to all beable to work on it productively. View “Maintainability”.

These words are often cast approximately without a clear expertise of what castle mean. In the interestof kind engineering, we will invest the remainder of this chapter trying out ways of reasoning aboutreliability, scalability, and also maintainability. Then, in the complying with chapters, we will look atvarious techniques, architectures, and also algorithms that are provided in bespeak to attain those goals.


Everybody has actually an intuitive idea of what it method for something to be reliable or unreliable. Forsoftware, common expectations include:

The applications performs the role that the user expected.

It have the right to tolerate the user make mistakes or utilizing the software program in unanticipated ways.

Its power is good enough because that the forced use case, under the intended load and also data volume.

The mechanism prevents any type of unauthorized accessibility and abuse.

If every those points together typical “working correctly,” climate we deserve to understand reliability asmeaning, roughly, “continuing to job-related correctly, also when things go wrong.”

The points that deserve to go not correct are dubbed faults, and systems that anticipate faults and can copewith them are dubbed fault-tolerant or resilient. The previous term is slightly misleading: itsuggests that we can make a mechanism tolerant that every possible kind that fault, which in fact isnot feasible. If the entire planet earth (and every servers ~ above it) wereswallowed by a black color hole, tolerance of the fault would certainly require net hosting in space—good luckgetting that spending plan item approved. For this reason it just makes sense to talk around tolerating certain typesof faults.

Note the a fault is not the very same as a failure<2>. A fault is usually defined as one ingredient of the systemdeviating indigenous its spec, whereas a failure is as soon as the system as a entirety stops giving therequired company to the user. That is impossible to mitigate the probability that a error to zero;therefore the is usually ideal to style fault-tolerance mechanisms that prevent faults from causingfailures. In this book we covering several techniques for building reliable solution from unreliableparts.

Counterintuitively, in such fault-tolerant systems, it have the right to make feeling to increase the price offaults by triggering them deliberately—for example, through randomly killing individual processeswithout warning. Many critical bugs are actually early to bad error handling<3>; by intentionally inducing faults, you ensurethat the fault-tolerance machinery is continuous exercised and also tested, which have the right to increase yourconfidence the faults will certainly be handled correctly when they occur naturally. The Netflix ChaosMonkey <4> is an instance of this approach.

Although we normally prefer tolerating faults over avoiding faults, over there are cases whereprevention is better than cure (e.g., due to the fact that no cure exists). This is the situation with securitymatters, for example: if one attacker has endangered a system and also gained access to perceptible data,that event cannot it is in undone. However, this book mostly deals with the type of faults that can becured, as defined in the adhering to sections.

Hardware Faults

When us think of reasons of device failure, hardware faults quickly come to mind. Difficult disks crash,RAM becomes faulty, the strength grid has actually a blackout, someone unplugs the wrong network cable. Anyonewho has worked with large datacenters deserve to tell you that these things take place all the time when youhave a many machines.

Hard disks are reported as having a average time to fail (MTTF) of about 10 come 50 years<5, 6>.Thus, on a storage cluster through 10,000 disks, we have to expect on mean one decaying to dice per day.

Our an initial response is generally to include redundancy come the separation, personal, instance hardware components in order toreduce the failure rate of the system. Disks might be collection up in a RAID configuration, servers might havedual power supplies and also hot-swappable CPUs, and also datacenters may have actually batteries and dieselgenerators for back-up power. As soon as one ingredient dies, the redundant component have the right to take the placewhile the damaged component is replaced. This method cannot completely prevent hardware problemsfrom bring about failures, however it is well understood and can regularly keep a device running uninterruptedfor years.

Until recently, redundancy that hardware materials was adequate for many applications, due to the fact that itmakes full failure the a single maker fairly rare. As lengthy as you can restore a back-up onto a newmachine relatively quickly, the downtime in situation of fail is not catastrophic in most applications.Thus, multi-machine redundancy to be only forced by a small number of applications for which highavailability to be absolutely essential.

However, as data volumes and applications’ computer demands have actually increased, more applications have begun usinglarger number of machines, which proportionally rises the rate of hardware faults. Moreover, insome cloud communication such together Amazon web Services (AWS) it is reasonably common because that virtual an equipment instancesto end up being unavailable there is no warning <7>, as the platforms room designed toprioritize flexibility and elasticityiover single-machine reliability.

Hence over there is a move toward systems that can tolerate the loss of whole machines, through usingsoftware fault-tolerance approaches in choice or in addition to hardware redundancy. Suchsystems also have work advantages: a single-server mechanism requires to plan downtime if youneed come reboot the an equipment (to apply operating device security patches, for example), vice versa, asystem that deserve to tolerate device failure deserve to be patched one node in ~ a time, there is no downtime the theentire mechanism (a rolling upgrade; see Chapter 4).

Software Errors

We normally think that hardware faults as being random and independent from each other: one machine’sdisk failing does not suggest that another machine’s disk is going to fail. There might be weakcorrelations (for example as result of a common cause, such together the temperature in the server rack), butotherwise the is unlikely the a large number that hardware contents will fail in ~ the very same time.

Another class of error is a methodical error within the system<8>.Such faults room harder come anticipate, and also because they room correlated across nodes, they tend tocause many much more system failures 보다 uncorrelated hardware faults<5>. Instances include:

A runaway process that supplies up some shared resource—CPU time, memory, decaying space, or networkbandwidth.

A organization that the system depends on the slows down, i do not care unresponsive, or start returningcorrupted responses.

The bugs that reason these type of software faults frequently lie dormant because that a lengthy time until they aretriggered by an unusual collection of circumstances. In those circumstances, the is revealed that thesoftware is make some kind of assumption about its environment—and while that assumption isusually true, it at some point stops gift true for some reason<11>.

There is no quick solution to the difficulty of methodical faults in software. Lots of small things canhelp: closely thinking about assumptions and also interactions in the system; thoroughly testing; processisolation; allowing processes come crash and also restart; measuring, monitoring, and analyzing systembehavior in production. If a device is expected to provide some guarantee (for example, in a messagequeue, that the variety of incoming messages equals the variety of outgoing messages), it canconstantly examine itself while it is running and also raise an alert if a discrepancy is found<12>.

Human Errors

Humans design and build software program systems, and the operator who store the systems running space alsohuman. Even when they have the finest intentions, people are known to be unreliable. Because that example, onestudy of big internet services found that construction errors by operator were the top causeof outages, vice versa, hardware faults (servers or network) played a duty in only 10–25% that outages<13>.

How carry out we make our systems reliable, despite unreliable humans? The finest systems incorporate severalapproaches:

Design equipment in a method that minimizes avenues for error. For example, well-designedabstractions, APIs, and admin interfaces make it straightforward to execute “the best thing” and discourage “thewrong thing.” However, if the interfaces are too restrictive civilization will work approximately them,negating your benefit, therefore this is a tricky balance to acquire right.

Allow quick and easy recovery from human errors, to minimize the affect in the situation of a failure.For example, make it fast to roll ago configuration changes, role out new code progressively (so thatany unanticipated bugs affect only a little subset the users), and provide tools come recompute data (incase it turns out the the old computation was incorrect).

Implement good management practices and also training—a complex and vital aspect, and beyond the limit ofthis book.

How crucial Is Reliability?

Reliability is not simply for nuclear strength stations and also air traffic regulate software—more mundaneapplications are also expected to work reliably. Bugs in business applications reason lostproductivity (and legal risks if numbers are report incorrectly), and also outages of ecommerce sitescan have vast costs in terms of shed revenue and damage to reputation.

Even in “noncritical” applications we have a responsibility to our users. Think about a parent whostores all their pictures and also videos that their kids in your picture application<15>. Just how would they feel if that database was all of sudden corrupted?Would lock know exactly how to regain it from a backup?

There are instances in which we may select to sacrifice dependability in order to alleviate developmentcost (e.g., when occurring a prototype product because that an unproven market) or operational expense (e.g., fora company with a really narrow benefit margin)—but we have to be an extremely conscious of when we arecutting corners.


Even if a system is functioning reliably today, that doesn’t average it will necessarily work reliably inthe future. One usual reason for degradation is increased load: maybe the system has actually grown indigenous 10,000concurrent users to 100,000 concurrent users, or from 1 million come 10 million. Perhaps it isprocessing much bigger volumes of data than it did before.

Scalability is the term we usage to define a system’s ability to cope with increased load. Note,however, the it is not a one-dimensional brand that us can affix to a system: it is meaningless tosay “X is scalable” or “Y no scale.” Rather, discussing scalability way considering questionslike “If the device grows in a particular way, what room our options for coping v the growth?” and“How deserve to we add computing resources to take care of the extr load?”

Describing Load

First, we should succinctly define the current load top top the system; just then deserve to we discussgrowth concerns (what happens if our load doubles?). Load deserve to be explained with a few numbers whichwe call load parameters. The best choice of parameters counts on the architecture of yoursystem: it might be request per 2nd to a internet server, the ratio of reads to writes in a database, thenumber the simultaneously active users in a conversation room, the hit rate on a cache, or something else.Perhaps the average instance is what matters because that you, or maybe your bottleneck is dominated by a smallnumber of excessive cases.

To make this idea more concrete, let’s think about Twitter together an example, utilizing data published inNovember 2012 <16>.Two that Twitter’s main operations are:

Post tweet

A user can publish a brand-new message to their followers (4.6k requests/sec ~ above average, over12k requests/sec at peak).

Home timeline

A user have the right to view tweets posted by the world they follow (300k requests/sec).

Simply taking care of 12,000 write per second (the peak rate because that posting tweets) would certainly be reasonably easy.However, Twitter’s scaling difficulty is no primarily because of tweet volume, but due tofan-outii—each user follows countless people, and also each useris complied with by numerous people. Over there are generally two ways of implementing these two operations:

Figure 1-2. Simple relational schema because that implementing a Twitter home timeline.
Figure 1-3. Twitter’s data pipeline for carrying tweets to followers, with load parameters as of November 2012 <16>.

The very first version that Twitter used technique 1, but the systems struggled to save up through the load ofhome timeline queries, therefore the firm switched to strategy 2. This works much better because the averagerate of published tweets is virtually two assignment of magnitude lower than the rate of home timelinereads, and also so in this instance it’s preferable to do more work at create time and less at read time.

However, the fence of strategy 2 is the posting a tweet currently requires a most extra work. Onaverage, a tweet is yielded to around 75 followers, therefore 4.6k tweets per 2nd become345k write per second to the home timeline caches. However this mean hides the fact that thenumber of pendant per user different wildly, and some users have over 30 million followers. Thismeans the a single tweet may an outcome in over 30 million writes to residence timelines! doing this in atimely manner—Twitter tries to provide tweets to pendant within five seconds—is a significantchallenge.

In the example of Twitter, the distribution of followers per user (maybe load by how regularly thoseusers tweet) is a crucial load parameter for stating scalability, because it determines the fan-outload. Her application may have an extremely different characteristics, but you can apply similar principlesto reasoning around its load.

The final twist of the Twitter anecdote: currently that approach 2 is robustly implemented, Twitter ismoving come a hybrid that both approaches. Most users’ tweets proceed to be fanned out to hometimelines at the time once they space posted, yet a small number of users through a very big number offollowers (i.e., celebrities) space excepted indigenous this fan-out. Tweets from any type of celebrities the auser might follow are fetched individually and linked with that user’s residence timeline when it is read,like in technique 1. This hybrid technique is may be to deliver consistently an excellent performance. We willrevisit this example in Chapter 12 after ~ we have covered some an ext technical ground.

Describing Performance

Once friend have explained the fill on her system, you have the right to investigate what happens once the loadincreases. You have the right to look in ~ it in two ways:

When you rise a load parameter and keep the system resources (CPU, memory, network bandwidth,etc.) unchanged, just how is the performance of your device affected?

When you increase a fill parameter, exactly how much do you need to boost the resources if you desire tokeep performance unchanged?

Both inquiries require performance numbers, so let’s look briefly at describing the power of asystem.

In a batch processing system such as Hadoop, we generally care around throughput—the number ofrecords we can process per second, or the full time the takes to operation a task on a dataset the a certainsize.iii In virtual systems, what’s usually much more important is the service’sresponse time—that is, the time in between a client sending a request and receiving a response.

Latency and an answer time

Latency and response time are frequently used synonymously, but they room not the same. The responsetime is what the customer sees: besides the yes, really time to procedure the request (the service time),it has network delays and also queueing delays. Latency is the duration the a request is wait tobe handled—during which that is latent, awaiting service<17>.

Even if you just make the exact same request over and over again, you’ll gain a slightly different responsetime on every try. In practice, in a system managing a selection of requests, the response time canvary a lot. We because of this need to think of response time not as a single number, however as adistribution of worths that you can measure.

In Figure 1-4, every gray bar to represent a request to a service, and its height shows just how longthat request took. Most requests are reasonably fast, but there space occasional outliers the takemuch longer. Possibly the slow requests space intrinsically much more expensive, e.g., due to the fact that they processmore data. But even in a scenario wherein you’d think every requests need to take the very same time, you getvariation: random extr latency can be introduced by a context move to a backgroundprocess, the loss of a network packet and also TCP retransmission, a garbage arsenal pause, a pagefault forcing a check out from disk, mechanically vibrations in the server rack<18>,or plenty of other causes.

Figure 1-4. Illustrating mean and also percentiles: solution times because that a sample of 100 requests come a service.

It’s common to view the average response time the a service reported. (Strictly speaking, the term“average” doesn’t describe any specific formula, but in practice it is usually interpreted as thearithmetic mean: provided n values, include up all the values, and divide by n.) However,the average is not a very great metric if you want to recognize your “typical” an answer time, due to the fact that itdoesn’t tell you how plenty of users actually competent that delay.

Usually that is better to use percentiles. If girlfriend take your list of an answer times and sort that fromfastest to slowest, climate the median is the halfway point: for example, if your typical responsetime is 200 ms, that means fifty percent your inquiry return in less than 200 ms, and half yourrequests take much longer than that.

This provides the median a an excellent metric if you desire to know exactly how long users commonly have to wait: halfof user requests are offered in less than the median response time, and the other half take longerthan the median. The typical is additionally known as the 50th percentile, and also sometimes abbreviated as p50.Note that the typical refers come a solitary request; if the user makes several requests(over the food of a session, or since several sources are had in a solitary page), theprobability the at least one of castle is slower 보다 the average is much higher than 50%.

In stimulate to figure out how bad your outliers are, you can look at higher percentiles: the 95th,99th, and 99.9th percentiles are usual (abbreviated p95, p99, and p999). They space theresponse time thresholds at which 95%, 99%, or 99.9% of requests are quicker than that particularthreshold. Because that example, if the 95th percentile an answer time is 1.5 seconds, that method 95 out of100 inquiry take less than 1.5 seconds, and also 5 the end of 100 requests take 1.5 seconds or more. This isillustrated in Figure 1-4.

High percentiles of an answer times, also known together tail latencies, space important since theydirectly impact users’ suffer of the service. Because that example, Amazon describes solution timerequirements for inner services in regards to the 99.9th percentile, even though it just affects 1in 1,000 requests. This is since the customers v the slowest inquiry are frequently those who havethe most data on your accounts due to the fact that they have made countless purchases—that is, lock the mostvaluable customers<19>.It’s important to save those client happy through ensuring the website is fast for them: Amazon hasalso observed that a 100 ms increase in solution time reduce sales through 1%<20>,and rather report that a 1-second slowdown reduce a customer satisfaction metric by 16%<21,22>.

On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemedtoo expensive and also to not yield enough advantage for Amazon’s purposes. Reducing an answer times at veryhigh percentiles is an overwhelming because they are easily influenced by random events outside of yourcontrol, and also the benefits are diminishing.

For example, percentiles are regularly used in service level objectives (SLOs) and also service levelagreements (SLAs), contract that specify the supposed performance and accessibility of a service.An SLA may state the the service is taken into consideration to be up if it has a median solution time of less than200 ms and a 99th percentile under 1 s (if the an answer time is longer, it might as wellbe down), and the company may be required to be up at least 99.9% the the time. These metrics setexpectations because that clients of the service and enable customers to need a refund if the SLA is notmet.

Queueing delays often account because that a large part that the response time at high percentiles. Together a server canonly process a small number of things in parallel (limited, for example, by its number of CPU cores),it only takes a small number of slow request to hold up the processing of succeeding requests—aneffect sometimes recognized as head-of-line blocking. Even if those succeeding requests are rapid toprocess ~ above the server, the client will view a sluggish overall an answer time due to the time waiting forthe prior inquiry to complete. Due to this effect, it is necessary to measure response times top top theclient side.

When generating fill artificially in order to test the scalability that a system, the load-generatingclient needs to keep sending requests independently of the response time. If the client waits forthe previous request to complete prior to sending the following one, that habits has the impact ofartificially keeping the queues shorter in the test 보다 they would certainly be in reality, which skews themeasurements <23>.

Percentiles in Practice

High percentiles end up being especially important in backend services that are referred to as multiple times aspart of serving a solitary end-user request. Even if you make the calls in parallel, the end-userrequest still demands to wait because that the slowest that the parallel calls come complete. That takes simply oneslow contact to do the whole end-user request slow, as shown in Figure 1-5.Even if just a tiny percentage that backend calls are slow, the opportunity of obtaining a sluggish callincreases if one end-user request needs multiple backend calls, and so a higher proportion ofend-user requests become slow (an effect known as tail latency amplification<24>).

If you want to add response time percentiles to the monitoring dashboards for her services, youneed to successfully calculate castle on an continuous basis. Because that example, you may want to store a rollingwindow of an answer times of requests in the last 10 minutes. Every minute, you calculate the medianand various percentiles over the worths in that home window and plot those metrics on a graph.

The naïve implementation is to keep a perform of solution times for every requests in ~ the timewindow and to kind that list every minute. If the is as well inefficient for you, there room algorithmsthat have the right to calculate a an excellent approximation the percentiles at minimal CPU and also memory cost, together asforward degeneration <25>, t-digest<26>, or HdrHistogram<27>.Beware that averaging percentiles, e.g., to reduce the time resolution or to integrate data fromseveral machines, is mathematically meaningless—the right means of aggregating solution time datais to add the histograms <28>.

Figure 1-5. As soon as several backend calls are required to serve a request, the takes simply a single slow backend request to slow-moving down the entire end-user request.

Approaches for Coping through Load

Now that we have questioned the parameters because that describing load and metrics for measuringperformance, we have the right to start pointing out scalability in earnest: just how do we maintain great performanceeven once our fill parameters increase by some amount?

An architecture that is appropriate for one level of fill is i can not qualify to cope v 10 time thatload. If you room working on a fast-growing service, it is thus likely that you will need torethink your style on every stimulate of magnitude pack increase—or possibly even an ext often thanthat.

People often talk that a dichotomy between scaling up (vertical scaling, relocating to a an ext powerfulmachine) and also scaling out (horizontal scaling, distributing the load across multiple smallermachines). Distributing load throughout multiple makers is likewise known as a shared-nothingarchitecture. A system that have the right to run top top a single device is often simpler, however high-end machines canbecome an extremely expensive, so really intensive workloads frequently can’t stop scaling out. In reality, goodarchitectures usually involve a pragmatic mixture of approaches: for example, using numerous fairlypowerful machines have the right to still it is in simpler and also cheaper than a huge number of tiny virtual machines.

Some systems room elastic, meaning that they deserve to automatically add computing resources once theydetect a fill increase, whereas various other systems room scaled manually (a human analyzes the capacity anddecides come add much more machines come the system). An elastic system deserve to be beneficial if pack is highlyunpredictable, yet manually scaled systems room simpler and may have actually fewer operational surprises(see “Rebalancing Partitions”).

While distributing stateless services across multiple machines is fairly straightforward, takingstateful data systems from a solitary node to a dispersed setup deserve to introduce a the majority of additionalcomplexity. Because that this reason, usual wisdom until recently was to store your database on a singlenode (scale up) until scaling expense or high-availability requirements required you to make itdistributed.

As the tools and also abstractions for dispersed systems gain better, this common wisdom might change, atleast for part kinds that applications. That is conceivable that dispersed data equipment will become thedefault in the future, also for use cases that nothing handle huge volumes the data or traffic. Over thecourse that the rest of this publication we will certainly cover countless kinds of spread data systems, and discuss howthey fare not simply in terms of scalability, but additionally ease that use and maintainability.

The architecture of equipment that run at big scale is typically highly certain to theapplication—there is no such point as a generic, one-size-fits-all scalable architecture(informally known as magic scaling sauce). The problem may be the volume that reads, the volume ofwrites, the volume the data to store, the intricacy of the data, the response time requirements, theaccess patterns, or (usually) part mixture of all of these to add many more issues.

For example, a device that is designed to handle 100,000 requests every second, each 1 kB insize, looks an extremely different from a system that is designed because that 3 requests every minute, each2 GB in size—even despite the two systems have the same data throughput.

An style that scale well because that a particular application is built approximately assumptions of whichoperations will certainly be common and which will be rare—the fill parameters. If those assumptions turnout to it is in wrong, the engineering initiative for scaling is at ideal wasted, and also at worstcounterproductive. In one early-stage startup or one unproven product that usually an ext important tobe able to iterate quickly on product attributes than that is to range to some hypothetical futureload.

Even despite they are particular to a details application, scalable architectures room neverthelessusually developed from general-purpose structure blocks, i ordered it in acquainted patterns. In this book wediscuss those structure blocks and patterns.


It is well known that the bulk of the price of software is not in its early development, however inits recurring maintenance—fixing bugs, maintaining its systems operational, investigating failures,adapting it to brand-new platforms, editing and enhancing it for brand-new use cases, repaying technological debt, and addingnew features.

Yet, unfortunately, many civilization working on software program systems dislike maintain of so-calledlegacy systems—perhaps it requires fixing other people’s mistakes, or working v platformsthat are currently outdated, or systems that were required to perform things castle were never ever intended for. Everylegacy device is unpleasant in its own way, and also so the is an overwhelming to give general recommendationsfor managing them.

However, we can and also should architecture software in such a means that it will hopefully minimization pain duringmaintenance, and thus avoid producing legacy software application ourselves. Come this end, we will certainly pay particularattention come three design principles for software program systems:


Make it easy for operations teams to save the mechanism running smoothly.


Make it easy for brand-new engineers to understand the system, by removing as much intricacy aspossible indigenous the system. (Note this is no the same as simplicity the the user interface.)


Make it straightforward for designers to make changes to the mechanism in the future, adapting it for unanticipateduse instances as requirements change. Also known together extensibility, modifiability, orplasticity.

As previously with reliability and scalability, there room no straightforward solutions because that achieving this goals.Rather, we will shot to think around systems with operability, simplicity, and also evolvability in mind.

Operability: do Life straightforward for Operations

It has actually been said that “good to work can often work approximately the constraints of bad (orincomplete) software, but good software cannot operation reliably with negative operations”<12>. When some elements of to work canand need to be automated, the is still up to human beings to set up that automation in the first place andto make sure it’s working correctly.

Operations teams are an important to maintaining a software mechanism running smoothly. A good operations teamtypically is responsible because that the following, and more<29>:

Tracking down the reason of problems, such as device failures or degraded performance

Keeping software and also platforms up to date, consisting of security patches

Keeping tabs on how different systems impact each other, so the a problematic adjust can beavoided before it causes damage

Anticipating future problems and also solving them prior to they happen (e.g., capacity planning)

Establishing an excellent practices and also tools for deployment, configuration management, and more

Performing complicated maintenance tasks, together as relocating an application from one platform to another

Maintaining the security of the system as configuration transforms are made

Defining processes that do operations predictable and assist keep the manufacturing environmentstable

Preserving the organization’s knowledge around the system, also as individual human being come and also go

Good operability method making routine jobs easy, permitting the work team to focus their effortson high-value activities. Data systems can do miscellaneous things to do routine work easy, including:

Providing visibility into the runtime behavior and internals of the system, with good monitoring

Providing good support for automation and also integration v standard tools

Avoiding exposed on individual equipments (allowing makers to it is in taken under for maintenancewhile the mechanism as a whole proceeds running uninterrupted)

Providing an excellent documentation and also an easy-to-understand operational model (“If I carry out X, Y will happen”)

Providing good default behavior, but also giving administrators the freedom to override defaults as soon as needed

Self-healing whereby appropriate, but likewise giving administrators manual control over the device state when needed

Exhibiting predictable behavior, minimizing surprises

Simplicity: managing Complexity

Small software projects can have delightfully simple and expressive code, but as jobs getlarger, castle often come to be very complicated and daunting to understand. This intricacy slows downeveryone who demands to work on the system, further increasing the price of maintenance. A softwareproject mired in complexity is sometimes defined as a big sphere of mud<30>.

There are various possible symptoms that complexity: to explode of the state space, tight coupling ofmodules, tangled dependencies, inconsistent naming and also terminology, hacking aimed in ~ solvingperformance problems, special-casing come work about issues elsewhere, and many more.Much has been said on this subject already<31,32,33>.

When intricacy makes maintain hard, budgets and schedules are regularly overrun. In complexsoftware, there is also a higher risk of introducing bugs once making a change: as soon as the mechanism isharder for developer to understand and also reason about, concealed assumptions, unintentional consequences,and unanticipated interactions are much more easily overlooked. Conversely, reducing intricacy greatlyimproves the maintainability of software, and also thus simplicity need to be a an essential goal for the systemswe build.

Making a system less complicated does not necessarily typical reducing the functionality; the can likewise meanremoving accidental complexity. Moseley and Marks<32> define intricacy as inadvertently ifit is not innate in the trouble that the software program solves (as seen by the users) however arises onlyfrom the implementation.

One that the best tools we have for remove accidental complexity is abstraction. A goodabstraction can hide a great deal the implementation information behind a clean, simple-to-understandfaçade. A great abstraction can additionally be offered for a wide variety of different applications. Not only isthis reuse more efficient than reimplementing a similar thing multiple times, yet it also leads tohigher-quality software, together quality renovations in the abstracted component advantage allapplications that use it.

For example, high-level programming languages space abstractions the hide an equipment code, CPU registers,and syscalls. SQL is an abstraction that hides complicated on-disk and in-memory data structures,concurrent requests from other clients, and also inconsistencies after crashes. Of course, whenprogramming in a high-level language, we space still using device code; us are simply not using itdirectly, due to the fact that the programming language abstraction saves us from having to think around it.

However, finding great abstractions is an extremely hard. In the field of distributed systems, return thereare many an excellent algorithms, the is much less clear just how we need to be packaging them right into abstractionsthat help us save the intricacy of the system at a controllable level.

Throughout this book, we will keep our eyes open up for good abstractions that enable us come extractparts that a huge system into well-defined, reusable components.

Evolvability: Making adjust Easy

It’s incredibly unlikely that your system’s needs will continue to be unchanged forever. They are much morelikely to be in consistent flux: you learn new facts, formerly unanticipated use instances emerge,business priorities change, customers request brand-new features, new platforms replace old platforms, legalor regulatory needs change, development of the system pressures architectural changes, etc.

In state of organizational processes, Agile functioning patterns provide a frame for adapting tochange. The Agile neighborhood has additionally developed technical tools and also patterns that are helpful whendeveloping software in a frequently changing environment, such as test-driven breakthrough (TDD) andrefactoring.

Most discussions of this Agile techniques emphasis on a reasonably small, regional scale (a pair of sourcecode records within the very same application). In this book, we search for ways of boosting agility onthe level that a bigger data system, perhaps consisting that several different applications or serviceswith different characteristics. Because that example, just how would friend “refactor” Twitter’s architecture forassembling house timelines (“Describing Load”) from technique 1 to technique 2?

The ease through which you can modify a data system, and adapt that to an altering requirements, is closelylinked come its simplicity and also its abstractions: simple and easy-to-understand systems room usuallyeasier to modify than complicated ones. But since this is such crucial idea, we will use adifferent indigenous to describe agility on a data system level: evolvability<34>.


In this chapter, we have actually explored some basic ways of thinking about data-intensiveapplications. These values will overview us through the remainder of the book, wherein we dive right into deeptechnical detail.

An application has actually to satisfy various demands in stimulate to it is in useful. There are functionalrequirements (what it need to do, such as enabling data to it is in stored, retrieved, searched, and also processed invarious ways), and some nonfunctional requirements (general properties like security,reliability, compliance, scalability, compatibility, and also maintainability). In this chapter wediscussed reliability, scalability, and maintainability in detail.

Reliability way making systems occupational correctly, even when faults occur. Faults can be in hardware(typically random and uncorrelated), software application (bugs are commonly systematic and hard to deal with),and people (who inevitably make mistakes indigenous time come time). Fault-tolerance techniques have the right to hidecertain species of faults native the end user.

Scalability method having strategies for maintaining performance good, even when pack increases. Inorder to comment on scalability, we very first need ways of relenten load and also performance quantitatively.We summary looked at Twitter’s residence timelines as an example of describing load, and response timepercentiles together a means of measure performance. In a scalable system, you can include processing capacityin order to remain dependable under high load.

Maintainability has many facets, yet in significance it’s about making life far better for the engineeringand operations teams who have to work through the system. Good abstractions can help reduce complexityand do the system much easier to modify and adapt for new use cases. An excellent operability way having goodvisibility into the system’s health, and also having efficient ways of regulating it.

There is unfortunately no easy settle for make applications reliable, scalable, or maintainable.However, over there are specific patterns and techniques that save reappearing in various kinds ofapplications. In the next few chapters we will certainly take a look at some instances of data equipment andanalyze exactly how they work toward those goals.

Later in the book, in Part III, we will certainly look at fads for equipment that consist of ofseveral components working together, such together the one in Figure 1-1.


i Definedin “Approaches for Coping through Load”.

ii A term borrowed from electronicengineering, whereby it defines the variety of logic gate inputs that are attached to another gate’soutput. The output needs to it is provided enough present to drive all the fastened inputs. In transactionprocessing systems, we use it to explain the variety of requests to other services that we need tomake in bespeak to serve one just arrive request.

iii In an ideal world, the running time that abatch job is the dimension of the dataset split by the throughput. In practice, the running time is oftenlonger, because of skew (data no being spread out evenly throughout worker processes) and also needing come wait because that theslowest task to complete.


<1> Michael Stonebraker and Uğur Çetintemel:“‘One SizeFits All’: an Idea whose Time has actually Come and also Gone,” in ~ 21st worldwide Conferenceon Data Engineering (ICDE), April 2005.

<2> Walter L. Heimerdinger and also Charles B. Weinstock:“A conceptual Framework for device FaultTolerance,” technical Report CMU/SEI-92-TR-033, Software engineering Institute, CarnegieMellon University, October 1992.

<3> Ding Yuan, Yu Luo, Xin Zhuang, et al.:“SimpleTesting can Prevent Most vital Failures: An evaluation of manufacturing Failures in DistributedData-Intensive Systems,” in ~ 11th USENIX Symposium top top Operating systems Designand Implementation (OSDI), October 2014.

<4> Yury Izrailevsky and also Ariel Tseitlin:“The Netflix Simian Army,”techblog.netflix.com, July 19, 2011.

<5> Daniel Ford, François Labelle, Florentina I. Popovici, et al.:“Availability in Globally distributed Storage Systems,”at 9th USENIX Symposium on Operating equipment Design and also Implementation (OSDI),October 2010.

<6> Brian Beach:“HardDrive dependability Update – Sep 2014,” backblaze.com, September 23, 2014.

<7> Laurie Voss:“AWS: The Good, theBad and the Ugly,” blog.awe.sm, December 18, 2012.

<9> Nelson Minar:“Leap 2nd Crashes Halfthe Internet,” somebits.com, July 3, 2012.

<10> Amazon internet Services:“Summary that the Amazon EC2 and also Amazon RDS ServiceDisruption in the US eastern Region,” aws.amazon.com, April 29, 2011.

<11> Richard I. Cook:“How ComplexSystems Fail,” Cognitive modern technologies Laboratory, April 2000.

<12> Jay Kreps:“GettingReal around Distributed device Reliability,” blog.empathybox.com, in march 19, 2012.

<13> David Oppenheimer, Archana Ganapathi, and David A. Patterson:“WhyDo net Services Fail, and also What deserve to Be Done about It?,” in ~ 4th USENIX Symposium onInternet Technologies and Systems (USITS), march 2003.

<14> Nathan Marz:“Principlesof software Engineering, component 1,” nathanmarz.com, April 2, 2013.

<15> Michael Jurewitz:“The Human influence of Bugs,”jury.me, in march 15, 2013.

<16> Raffi Krikorian:“Timelines at Scale,”at QCon mountain Francisco, November 2012.

<18> Kelly Sommers:“After every that run around,what caused 500ms decaying latency even when we changed physical server?” twitter.com, November 13, 2014.

<19> Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.:“Dynamo:Amazon’s Highly accessible Key-Value Store,” at 21st ACM Symposium on OperatingSystems Principles (SOSP), October 2007.

<20> Greg Linden:“Make DataUseful,” slides from presentation at Stanford college Data Mining course (CS345), December 2006.

<21> Tammy Everts:“TheReal cost of slow Time vs Downtime,” webperformancetoday.com, November 12, 2014.

<22> Jake Brutlag:“Speed problem forGoogle web Search,” googleresearch.blogspot.co.uk, June 22, 2009.

<23> Tyler Treat:“Everything girlfriend KnowAbout Latency Is Wrong,” bravenewgeek.com, December 12, 2015.

<25> Graham Cormode, VladislavShkapenyuk, Divesh Srivastava, and Bojian Xu:“Forward Decay: A PracticalTime degeneration Model for Streaming Systems,” at 25th IEEE global Conference on DataEngineering (ICDE), march 2009.

<26> Ted Dunning and also Otmar Ertl:“Computing extremely Accurate Quantiles Usingt-Digests,” github.com, march 2014.

<27> Gil Tene:“HdrHistogram,” hdrhistogram.org.

<28> Baron Schwartz:“WhyPercentiles Don’t work-related the means You Think,” vividcortex.com, December 7, 2015.

<29> James Hamilton:“OnDesigning and Deploying Internet-Scale Services,” at 21st huge InstallationSystem management Conference (LISA), November 2007.

<30> Brian Foote and Joseph Yoder:“Big round of Mud,” at4th Conference on pattern Languages the Programs (PLoP),September 1997.

<32> Ben Moseley and Peter Marks:“Out of the Tar Pit,”at BCS Software exercise Advancement (SPA), 2006.

See more: Grimm Fairy Tales Coloring Pages, Grimm Fairy Tales Adult Coloring Book Paperback

<33> wealthy Hickey:“Simple make Easy,”at Strange Loop, September 2011.

<34> Hongyu Pei Breivold, Ivica Crnkovic, and also Peter J. Eriksson:“Analyzing software Evolvability,”at 32nd annual IEEE International computer Software and Applications Conference(COMPSAC), July 2008.doi:10.1109/COMPSAC.2008.50