Many applications today are data-intensive, as opposed to computew-intensive, and CPU is rarely a limiting factor for these applications. Bigger problems are usually the amount of data, the complexity of data and the speed at which it is changing.
A data-intensive application is typically built from standard building blocks that pro‐ vide commonly needed functionality.
important in most software systems:
-
Reliability : The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐ ware faults, and even human error).
-
Scalability : As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
-
Maintainability : Over time, many different people will work on the system (engineering and oper‐ ations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
Reliability
A fault is not the same as a failure.
A fault is usually defined as one com‐ ponent of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.
By deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.
Hardware Faults
Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years.
Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
First response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.
So multi-machine redundancy was only required by a few applications for which high availability was absolutely essential.
Software Errors
Another class of fault is a systematic error within the system.
Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults.
The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances.
There is no quick solution to the problem of systematic faults in software.
Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production. If a system is expected to provide some guarantee (for example, in a message queue, that the num‐ ber of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found.
Human Errors
Humans design and build software systems, and the operators who keep the systems running are also human.
Even when they have the best intentions, humans are known to be unreliable.
How to prevent:
-
Design systems in a way that minimizes opportunities for error.
-
Decouple the places where people make the most mistakes from the places where they can cause failures.
-
Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests.
-
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure.
-
Set up detailed and clear monitoring, such as performance metrics and error rates.
-
Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.
Scalability
Scalability is the term we used to describe a system’s ability to cope with increased load.
Discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”
Describing Performance
Once you have described the load on your system, you can investigate what happens when the load increases.
we usually care about throughput—the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.
Note
Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.
Reducing response times at very high percentiles is difficult because they are easily affected by random events outside your control, and the benefits are diminishing.
Queueing delays often account for a large part of the response time at high percentiles.
As a server can only process a few things in parallel (limited, for example, by its number of CPU cores), it only takes a few slow requests to hold up the processing of subsequent requests
Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client side.
The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce).
The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.
An architecture that scales well for a particular application is built around assump‐ tions of which operations will be common and which will be rare—the load parame‐ ters.
If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive.
In an early-stage startup or an unpro‐ ven product it’s usually more important to be able to iterate quickly on product fea‐ tures than it is to scale to some hypothetical future load.
Maintainability
Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them.
However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:
Operability
Make it easy for operations teams to keep the system running smoothly.
Simplicity
Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)
Evolvability
Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibil‐ ity, modifiability, or plasticity.
Operability: Making Life Easy for Operations
good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations.
A good operations team typically is responsible for the following, and more:
- Monitoring the health of the system and quickly restoring service if it goes into a bad state
- Tracking down the cause of problems, such as system failures or degraded per‐ formance
- Keeping software and platforms up to date, including security patches
- Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
- Anticipating future problems and solving them before they occur (e.g., capacity planning)
- Establishing good practices and tools for deployment, configuration manage‐ ment, and more
- Performing complex maintenance tasks, such as moving an application from one platform to another
- Maintaining the security of the system as configuration changes are made
- Defining processes that make operations predictable and help keep the produc‐ tion environment stable
- Preserving the organization’s knowledge about the system, even as individual people come and go
Simplicity: Managing Complexity
A software project mired in complexity is sometimes described as a big ball of mud
There are various possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more.
Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.
One of the best tools we have for removing accidental complexity is abstraction.
A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand facade.
Evolvability: Making Change Easy
In terms of organizational processes, Agile working patterns provide a framework for adapting to change.
The Agile community has also developed technical tools and pat‐ terns that are helpful when developing software in a frequently changing environ‐ ment, such as test-driven development (TDD) and refactoring.
Summary
Reliability
means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically sys‐ tematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.
Scalability
means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.
Maintainability
has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstrac‐ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.
There is unfortunately no easy fix for making applications reliable, scalable, or main‐ tainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.
CHAPTER 2: Data Models and Query Languages
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
data model has such a profound effect on what the software above it can and can’t do, it’s important to choose one that is appropriate to the application.
The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970: data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).
The goal of the relational model was to hide that implementation detail behind a cleaner interface.
As computers became vastly more powerful and networked, they started being used for increasingly diverse purposes.
And remarkably, relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases.
Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, ecommerce, games, software-as-a-service productivity applications, or much more.
Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case. It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of nonrelational datastores, an idea that is sometimes called polyglot persistence
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models.
The JSON representation has better locality than the multi-table schema. If you want to fetch a profile in the relational example, you need to either perform multiple queries (query each table by user_id) or perform a messy multi- way join between the users table and its subordinate tables.
In the JSON representation, all the relevant information is in one place, and one query is sufficient.
What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it.
However, when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model. That identifier is resolved at read time by using a join or follow-up queries.
What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it.
Document databases reverted back to the hierarchical model in one aspect: storing nested records (one-to-many relationships, like positions, education, and contact_info in Figure 2-1) within their parent record rather than in a separate table.
However, when it comes to representing many-to-one and many-to-many relation‐ ships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model.
That identifier is resolved at read time by using a join or follow-up queries.
Relational Versus Document Databases
The main arguments in favor of the document data model are schema flexibility, bet‐ ter performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
data model leads to simpler application code?
If the data in your application has a document-like structure (i.e., a tree of one-to- many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model. The relational technique of shredding— splitting a document-like structure into multiple tables — can lead to cumbersome schemas and unnecessarily complicated application code.
The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model). However, as long as documents are not too deeply nested, that is not usually a problem.
The poor support for joins in document databases may or may not be a problem, depending on the application.
However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent.
Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database.
In such cases, using a document model can lead to significantly more complex application code and worse performance.
NoSQL
There are several driving forces behind the adoption of NoSQL databases, including:
-
A need for greater scalability than relational databases can easily achieve, includ‐ ing very large datasets or very high write throughput
-
A widespread preference for free and open source software over commercial database products
-
Specialized query operations that are not well supported by the relational model
-
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model
Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case.
It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of nonrelational datastores—an idea that is sometimes called polyglot persistence.
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models.
The JSON representation has better locality than the multi-table schema.
If you want to fetch a profile in the relational example, you need to either perform multiple queries (query each table by user_id) or perform a messy multi- way join between the users table and its subordinate tables. In the JSON representa‐ tion, all the relevant information is in one place, and one query is sufficient.
What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it.
However, when it comes to representing many-to-one and many-to-many relation‐ ships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model.
That identifier is resolved at read time by using a join or follow-up queries.
The main arguments in favor of the document data model are schema flexibility, bet‐ ter performance due to locality, and that for some applications it is closer to the data structures used by the application.
The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
Which data model leads to simpler application code?
If the data in your application has a document-like structure (i.e., a tree of one-to- many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model.
The document model has limitations: for example, you cannot refer directly to a nes‐ ted item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model). However, as long as documents are not too deeply nested, that is not usually a problem.
The poor support for joins in document databases may or may not be a problem, depending on the application.
Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database. In such cases, using a document model can lead to significantly more complex appli‐ cation code and worse performance.
Schema flexibility in the document model
No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guaran‐ tees as to what fields the documents may contain.
A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write.
Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception.
In situations like these, a schema may hurt more than it helps, and schemaless documents can be a much more natural data model.
But in cases where all records are expected to have the same structure, schemas are a useful mechanism for document‐ ing and enforcing that structure.
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON).
If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality.
If data is split across multiple tables, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.
The locality advantage only applies if you need large parts of the document at the same time. The database typically needs to load the entire document, even if you access only a small portion of it, which can be wasteful on large documents.
On updates to a document, the entire document usually needs to be rewritten—only modifications that don’t change the encoded size of a document can easily be performed in place.
It’s worth pointing out that the idea of grouping related data together for locality is not limited to the document model.
a good thing: the data models complement each other. If a database is able to handle document-like data and also perform relational queries on it, applications can use the combination of features that best fits their needs.
Query Languages for Data
A declarative query language is attractive because it is typically more concise and eas‐ ier to work with than an imperative API.
But more importantly, it also hides imple‐ mentation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results. The database is free to use a parallel implementation of the query language
MapReduce Querying
MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google.
A limited form of MapReduce is supported by some NoSQL datastores, including MongoDB and CouchDB, as a mechanism for performing read-only queries across many documents.
MapReduce in general is described in more detail in Chapter 10. For now, we’ll just briefly discuss MongoDB’s use of the model.
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between.
The map and reduce functions are somewhat restricted in what they are allowed to do.
They must be pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects.
These restrictions allow the database to run the functions anywhere, in any order, and rerun them on failure.
Summary
Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem.
More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:
-
Document databases target use cases where data comes in self-contained docu‐ ments and relationships between one document and another are rare.
-
Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model —for example, graph data can be represented in a relational database—but the result is often awkward.
That’s why we have different systems for different purposes, not a single one-size-fits-all solution.
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements.
However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read).
Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog.