Designing Data-Intensive Applications (4) - Data Encoding and Evolution

This is my personal learning notes from reading <> Chapter 4

Why we need data encoding?

Every application is built to working on some kind of data, think about the input, output or intermediate storage during process, they are all data. But based on usage, format, storage type, there are different types/representations of data.

For a system, there are usually two different representations of data:

  • Data in memory: Usually optimized for efficient access and computation by CPU.
  • Data in persistent storage or over network: Usually needs to consider the efficiency, security during the transfer.

You will find that a system needs to transfer between these two representations frequently. For example, when you create a Person object to represents person information, then you probably need to send it through http to client, store it to database, save it to cache etc, but can you still use the same object/class representation? usually not. Here is where data encoding comes in.

Encoding(serialization or marshalling) is defined as the translation from in-memory representation to a byte sequence.

The reverse process of encoding(serialization or marshalling) is called parsing (deserialization, unmarshalling).

Language-Specific Formats

Since data encoding is just a common requirement, many languages/libraries provide built-in support.

  • Java: java.io.Serializable
  • Ruby: Marshal
  • Python: pickle

Advantages

  • Convenient

Disadvantages

  • Specific to a language, not support by other languages, not good for transmit.
  • Not optimized for efficiency: Based on the data you are working on
  • Lack of features: schema, versioning, compatibility.
  • Security concerns: language specific encoding method is too generic that it allows to instantiate arbitrary classes (because of no concept of schema)

Json, XML, CSV

Advantages

  • More human readable
  • Widely supported

Disadvantages

  • Ambiguity: hard to distinguish different data types
  • Weak schema support
  • Somewhat complex

Needless to say, Json/XML/CSV is still widely used in current applications, because of the ease of use, human readable.

Binary encoding

For data intensive applications, they often needs to transfer a huge amount of data between different processes and not always need human readable, in this case, Json/XML/CSV will become a bottleneck in terms of efficiency, thus a binary encoding can help.

Binary encoding transforms data to a byte sequence in a compact way.

  • MessagePack: Saving more space than Json, but not a huge gain because every character in Json needs to be encoded.

  • Thrift: Requires a defined schema, so it can save more spaces than MessagePack by using tag to represent fields. It relies on code generation to manipulate data which is good for code static check.

  • Protocol Buffers: Very similar to Thrift at it’s core.

  • Avro: Schema required too, but it enhanced further than Thrift by compacting values together without field tags, so it can save even more space than Thrift. Because only values are compacted, it gives Avro another advantage that you can always generate new schema based on the data you received (Dynamically Generated Schema).

Schema evolution

This topic is about how to handle schema changing but maintain forward/backward compatibility,

For Thrift and Protocol Buffer:

  • When adding new field, also assign a new tag number, but make it optional
  • When removing a field, you can only remove an optional field
  • When changing data type of existing field: Totally depends on the data types support in specific technology, also checking the precision and truncation rules.

For Avro:

  • The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible

Scenarios where data encoding and schema evolution involved

  • Design a database
  • Design API contract (REST, RPC): API versioning
  • Messaging passing: messages pub/sub through Kafka, Pulsa, RabbitMQ, Actor model(Akka)