During the first days of learning distributed system design, we heard a lot buzzwords and technologies, and we are busy with learning one after one.

Micro services architecture
CDN, Caching, load balancer
Event Sourcing
Messaging
CAP
…

Needless to say, each of these topics are complex enough and takes years to learn. What we expect is: when we are tasked with a system design problem, we will be able to have a robust architecture overall.

Well, this is important, but when after you see tons of architectures in different use cases, still there is no one fit all solution. There are always tough challenges come up. and usually, the issue that stops you is not the high level design, it’s some small problems instead.

Sticky Session

Now we have a lot machines behind the load balancer can serve the traffic, that means requests from a same user can be served by different nodes, hmm, do I need to sign in again every time if the node doesn’t have my session info?

Absolutely not necessary, now you may think of sticky session which routes the requests for a particular session to the same physical machine that serviced the first request for that session. But problem is not totally solved at this point yet, this method may still cause uneven loads to service which actually we want to solve by distributed system.

http://www.chaosincomputing.com/2012/05/sticky-sessions-are-evil/

Distributed transactions

When we throw out a nice distributed architecture, we are confident that it supports high throughputs with decent performance, and you also adds on that, it can handle traffic bursts by adding more machines since it’s a distributed oriented. This looks nice, but how do you handle failures in payment transactions?

I will stuck here for a while, yes, this is not easy, in one machine environment, we can locking and waiting, but if the system is distributed, how can we make sure the transaction ACID?.

I am not a innovator, but curiosity drives me to find answers with two phase commits and PAXOS. There are tons of materials about those, just giving some I read:
https://shekhargulati.com/2018/09/05/two-phase-commit-protocol/

Distributed Tracing.
It’s common that a user request may cause cascading services calls in the backend, in case of data analysis or debugging, how can we stitch requests to the same origin?

Using same RequestID is a common strategy, but usually, you will need another service/lib dedicated for distributed tracing. There are some popular frameworks for it, like opentracing.

Summary

Distributed system is good, but also introduces more problems you need to solve.
Generate a fancy system architecture is easy, what’s more important is how to solve the problems it brings.