In the autumn of 2018 we started an epic project of a next generation SSP system, to turn our hardcoded fixed-order ad request model to a fully configurable one. This pretty much guaranteed a large boost of revenue and a total revision of our code base.
Background
- We were in a heavily solution-oriented business. In this project, a nearly zero downtime was mandatory.
- We had a small dev team, consisting of 2 senior devs and 1 junior, maintaining several other system components at the same time.
- This project was the first priority until end of the year. However, the roadmap for next year was unpredictable.
Derived from the situation, we decided to progress in the strangler fig pattern, which means to gradually replace your system organs step by step until you reach the revision goal.
Pros
This strategy gave us the benefits:
- If carefully planned, we could do the revision with almost zero downtime.
- We could test and verify the system by small pieces with limited dev resources.
- We might be able to receive partial value of the project along the development.
- We could pause the project at certain points and pick it back later on.
Cons
There were also drawbacks:
- We needed to spend more energy on careful deployment and data migration.
- More temporary code meant bigger development time cost.
The Practice of Strangler Fig
People often talk about this pattern in the context of migrating legacy system to microservice architecture, but the concept can be generalized. In our case, it was more like an internal revision of a system of a few monoliths. It’s also helpful to be aware of the concept of the anti-corruption layer pattern.
The key idea is to start renovating the system on one node in the data flow, and then expand the territory piece by piece while maintaining a few anti-corruption layers on the boundary between new and legacy parts. Note that since we do not limit this pattern in the context of microservice, each step of renovation does not need to be a whole component/service; it can instead be as small as a software layer inside an application. The same idea applies to anti-corruption layers: they can be as thin as a function call, rather than a whole standalone service.
We followed these steps to establish the development plan:
- Layout the data flow diagram of the system. It’s up to us to split the system into workable chunks of proper sizes.
- Analyze each data flow dependency to realize possible expanding direction. In most of cases, it will be determined by how data schema are changed.
- Determine where to start, which mostly depends on the business value, business logic, and the data flow analysis.
- Determine steps to expand the territory and estimate the delivery schedule.
- Work on detailed application deployment and data migration plans for each step.
Example
Take our case for example, we started with a (overly simplified) data flow diagram:
Management |
|
|
|
Service |
|
Pipeline |
|
DB Tables |
|
Service |
In this project, the new data model sit inside a superspace of the old one. That means the new version of each part was capable of accepting data output from its old upstream given proper data transformer, which would be implemented inside its associated anti-corruption layer. On the other hand, it might not be so easy if we had proceeded from the other direction. Thus, we decided to start from the database, then marched upstream. (Note that each v2 part was a software layer sitting in the same application/database with its v1 counterpart.)
v1 |
Management |
|
|
|
Service |
|
Pipeline |
|
DB Tables |
|
Service |
|||
v2 |
Management |
|
|
|
Service |
|
Pipeline |
|
DB Tables |
|
Service |
|||
v2 |
Management |
|
|
|
Service |
|
Pipeline |
|
DB Tables |
|
Service |
v2 | ||
|
|
|||||||||||||
|
|
|
|
|||||||||||
|
|
Useful Tactics
Feature Flag
When we want to be more secure on a task that changes some critical system behavior, we can split the work into commits to minimize the code change in the one commit that really modifies the behavior. Optimally, the behavior-changing commit can be as small as a one-liner switch of feature flag. This method gives us a safer retreat scenario. A typical commit sequence would look like:
- Refactoring (create the feature flag mechanism)
- Switch feature flag
- Verification (we can stay in this stage for a while, until we feel secure)
- Refactoring (eliminate the feature flag mechanism)
I often illustrate this idea by the “three points of contact” principle in rock climbing.
Extra Verification with Production Data
Sometimes we don’t have the confidence to just roll out the change. In such case, we can let both new and legacy parts coexist for a while, verify and monitor the their outcomes, and discard the legacy part when we think it’s ready. A typical workflow would be:
- Build the new part and connect to the data flow upstream.
- The current anti-corruption layer takes outcomes from both parts, verify and monitor them.
- When the new part is thoroughly verified, switch the data path to use the new version. We can keep monitoring facility for a while just in case.
- Remove the legacy part and clean up.