This is my personal learning notes from reading <> Chapter 4

Why we need data encoding?

Every application is built to working on some kind of data, think about the input, output or intermediate storage during process, they are all data. But based on usage, format, storage type, there are different types/representations of data.

For a system, there are usually two different representations of data:

Data in memory: Usually optimized for efficient access and computation by CPU.
Data in persistent storage or over network: Usually needs to consider the efficiency, security during the transfer.

You will find that a system needs to transfer between these two representations frequently. For example, when you create a Person object to represents person information, then you probably need to send it through http to client, store it to database, save it to cache etc, but can you still use the same object/class representation? usually not. Here is where data encoding comes in.

Encoding(serialization or marshalling) is defined as the translation from in-memory representation to a byte sequence.

The reverse process of encoding(serialization or marshalling) is called parsing (deserialization, unmarshalling).

Language-Specific Formats

Since data encoding is just a common requirement, many languages/libraries provide built-in support.

Java: java.io.Serializable
Ruby: Marshal
Python: pickle
…

Advantages

Convenient

Disadvantages

Specific to a language, not support by other languages, not good for transmit.
Not optimized for efficiency: Based on the data you are working on
Lack of features: schema, versioning, compatibility.
Security concerns: language specific encoding method is too generic that it allows to instantiate arbitrary classes (because of no concept of schema)

Json, XML, CSV

Advantages

More human readable
Widely supported

Disadvantages

Ambiguity: hard to distinguish different data types
Weak schema support
Somewhat complex

Needless to say, Json/XML/CSV is still widely used in current applications, because of the ease of use, human readable.

Binary encoding

For data intensive applications, they often needs to transfer a huge amount of data between different processes and not always need human readable, in this case, Json/XML/CSV will become a bottleneck in terms of efficiency, thus a binary encoding can help.

Binary encoding transforms data to a byte sequence in a compact way.

MessagePack: Saving more space than Json, but not a huge gain because every character in Json needs to be encoded.
Thrift: Requires a defined schema, so it can save more spaces than MessagePack by using tag to represent fields. It relies on code generation to manipulate data which is good for code static check.
Protocol Buffers: Very similar to Thrift at it’s core.
Avro: Schema required too, but it enhanced further than Thrift by compacting values together without field tags, so it can save even more space than Thrift. Because only values are compacted, it gives Avro another advantage that you can always generate new schema based on the data you received (Dynamically Generated Schema).

Schema evolution

This topic is about how to handle schema changing but maintain forward/backward compatibility,

For Thrift and Protocol Buffer:

When adding new field, also assign a new tag number, but make it optional
When removing a field, you can only remove an optional field
When changing data type of existing field: Totally depends on the data types support in specific technology, also checking the precision and truncation rules.

For Avro:

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible

Scenarios where data encoding and schema evolution involved

Design a database
Design API contract (REST, RPC): API versioning
Messaging passing: messages pub/sub through Kafka, Pulsa, RabbitMQ, Actor model(Akka)