- Formats for Encoding Data
- Modes of Dataflow
Applications inevitably change over time. Features are added or modified as new products are launched, user requirements become better understood, or business circumstances change.
When a data format or schema changes, a corresponding change to application code often needs to happen. However, in a large application, code changes often cannot happen instantaneously.
Backward compatibility: Newer code can read data that was written by older code.
Forward compatibility: Older code can read data that was written by newer code.
Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data).
Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.
Formats for Encoding Data
Programs usually work with data in (at least) two different representations:
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory.
These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code.
However, they also have a number of deep problems:
The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes.
Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility.
Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought. For example, Java’s built-in serialization is notorious for its bad performance and bloated encoding.
JSON, XML, and Binary Variants
They are widely known, widely supported, and almost as widely disliked.
XML is often criticized for being too verbose and unnecessarily complicated.
CSV is another popular language-independent format, albeit less powerful.
JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a popular topic of debate).
Besides the superficial syntactic issues, they also have some subtle problems:
There is a lot of ambiguity around the encoding of numbers. JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.
JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings.
Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format.
For a small dataset, the gains are negligible, but once you get into the terabytes, the choice of data format can have a big impact.
JSON is less verbose than XML, but both still use a lot of space compared to binary formats.
Modes of Dataflow
Dataflow Through Databases
a value in the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for databases.
When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful.
Subsequently, an older version of the code (which doesn’t yet know about the new field) reads the record, updates it, and writes it back.
In this situation, the desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t be interpreted.
Dataflow Through Services: REST and RPC
Although HTTP may be used as the transport protocol, the API implemented on top is application-specific, and the client and server need to agree on the details of that API.
In some ways, services are similar to databases: they typically allow clients to submit and query data.
However, while databases allow arbitrary queries using the query languages, services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic (application code) of the service.
This restriction provides a degree of encapsulation: services can impose fine-grained restrictions on what clients can and cannot do.
A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.
For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams.
In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API.
There are two popular approaches to web services: REST and SOAP.
They are almost diametrically opposed in terms of philosophy, and often the subject of heated debate among their respective proponents.
REST and SOAP
REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP.
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation.
REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration, and is often associated with microservices.
An API designed according to the principles of REST is called RESTful.
By contrast, SOAP is an XML-based protocol for making network API requests.
Although it is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features.
Instead, it comes with a sprawling and complex multitude of related standards (the web service framework, known as WS-*) that add various features.
although SOAP is still used in many large enterprises, it has fallen out of favor in most smaller companies.
RESTful APIs tend to favor simpler approaches, typically involving less code generation and automated tooling.
A definition format such as OpenAPI, also known as Swagger, can be used to describe RESTful APIs and produce documentation.
the idea of a remote procedure call (RPC), which has been around since the 1970s.
The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency).
Although RPC seems convenient at first, the approach is fundamentally flawed. A network request is very different from a local function call:
A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control.
A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control.
Network problems are common, so you have to anticipate them, for example by retrying a failed request.
A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes).
A network request has another possible outcome: it may return without a result, due to a timeout. In that case, you simply don’t know what happened
if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. (We discuss this issue in more detail in Chapter 8.)
If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost.
In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication (idempotence) into the protocol.
Local function calls don’t have this problem.
Every time you call a local function, it normally takes about the same time to execute.
A network request is much slower than a function call, and its latency is also wildly variable: at good times it may complete in less than a millisecond, but when the network is congested or the remote service is overloaded it may take many seconds to do exactly the same thing.
When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network.
That’s okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.
The client and the service may be implemented in different programming lan‐ guages, so the RPC framework must translate datatypes from one language into another. This can end up ugly, since not all languages have the same types.
Despite all these problems, RPC isn’t going away.
Various RPC frameworks have been built on top of all the encodings mentioned in this chapter: for example, Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.
This new generation of RPC frameworks is more explicit about the fact that a remote request is different from a local function call. Encapsulate asynchronous actions that may fail. Futures also simplify situations where you need to make requests to multiple services in parallel and combine their results.
Some of these frameworks also provide service discovery—that is, allowing a client to find out at which IP address and port number it can find a particular service.
Custom RPC protocols with a binary encoding format can achieve better performance than something generic like JSON over REST.
it is good for experimentation and debugging.
it is supported by all main‐stream programming languages and platforms, and there is a vast ecosystem of tools available.
For these reasons, REST seems to be the predominant style for public APIs. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.
message broker (also called a message queue or message-oriented middleware), which stores the message temporarily.
Using a message broker has several advantages compared to direct RPC:
• It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
• It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
• It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
• It allows one message to be sent to several recipients.
• It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
Message brokers typically don’t enforce any particular data model—a message is just a sequence of bytes with some metadata, so you can use any encoding format.
During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code.
Thus, it is important that all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).
Programming language–specific encodings are restricted to a single programming language and often fail to provide forward and backward compatibility.
Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you use them. They have optional schema languages, which are sometimes helpful and sometimes a hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things like numbers and binary strings.
Binary schema–driven formats like Thrift, Protocol Buffers, and Avro allow compact, efficient encoding with clearly defined forward and backward compatibility semantics. The schemas can be useful for documentation and code generation in statically typed languages. However, they have the downside that data needs to be decoded before it is human-readable.
several modes of dataflow, illustrating different scenarios in which data encodings are important:
• Databases, where the process writing to the database encodes the data and the process reading from the database decodes it
• RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
• Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient