Strangler Fig

In the autumn of 2018 we started an epic project of a next generation SSP system, to turn our hardcoded fixed-order ad request model to a fully configurable one. This pretty much guaranteed a large boost of revenue and a total revision of our code base.

Background

  • We were in a heavily solution-oriented business. In this project, a nearly zero downtime was mandatory.
  • We had a small dev team, consisting of 2 senior devs and 1 junior, maintaining several other system components at the same time.
  • This project was the first priority until end of the year. However, the roadmap for next year was unpredictable.

Derived from the situation, we decided to progress in the strangler fig pattern, which means to gradually replace your system organs step by step until you reach the revision goal.

Pros

This strategy gave us the benefits:

  • If carefully planned, we could do the revision with almost zero downtime.
  • We could test and verify the system by small pieces with limited dev resources.
  • We might be able to receive partial value of the project along the development.
  • We could pause the project at certain points and pick it back later on.

Cons

There were also drawbacks:

  • We needed to spend more energy on careful deployment and data migration.
  • More temporary code meant bigger development time cost.

The Practice of Strangler Fig

People often talk about this pattern in the context of migrating legacy system to microservice architecture, but the concept can be generalized. In our case, it was more like an internal revision of a system of a few monoliths. It’s also helpful to be aware of the concept of the anti-corruption layer pattern.

The key idea is to start renovating the system on one node in the data flow, and then expand the territory piece by piece while maintaining a few anti-corruption layers on the boundary between new and legacy parts. Note that since we do not limit this pattern in the context of microservice, each step of renovation does not need to be a whole component/service; it can instead be as small as a software layer inside an application. The same idea applies to anti-corruption layers: they can be as thin as a function call, rather than a whole standalone service.

We followed these steps to establish the development plan:

  1. Layout the data flow diagram of the system. It’s up to us to split the system into workable chunks of proper sizes.
  2. Analyze each data flow dependency to realize possible expanding direction. In most of cases, it will be determined by how data schema are changed.
  3. Determine where to start, which mostly depends on the business value, business logic, and the data flow analysis.
  4. Determine steps to expand the territory and estimate the delivery schedule.
  5. Work on detailed application deployment and data migration plans for each step.

Example

Take our case for example, we started with a (overly simplified) data flow diagram:

Inventory
Management
Ad Tag Ad Serving
Service
Data
Pipeline
Report
DB Tables
Report
Service

In this project, the new data model sit inside a superspace of the old one. That means the new version of each part was capable of accepting data output from its old upstream given proper data transformer, which would be implemented inside its associated anti-corruption layer. On the other hand, it might not be so easy if we had proceeded from the other direction. Thus, we decided to start from the database, then marched upstream. (Note that each v2 part was a software layer sitting in the same application/database with its v1 counterpart.)

/
v1 Inventory
Management
Ad Tag Ad Serving
Service
Data
Pipeline
Report
DB Tables
Report
Service
v1
v2 Inventory
Management
Ad Tag Ad Serving
Service
Data
Pipeline
Report
DB Tables
Report
Service
v2
v2 Inventory
Management
Ad Tag Ad Serving
Service
Data
Pipeline
Report
DB Tables
Report
Service
v2
 
 
 
 
  • We started with migrating report data.
  • Correctness of data in v2 tables were constantly verified and monitored.
  • We used the same transform function on the output of data pipeline to write new report data into v2 tables.
  • The new report service was implemented as an experiment, only accessible by developers.
  • We moved verification target from DB tables to report service.
  • We advanced the frontline to data pipeline.
  • Having enough confidence, we switched to use report service v2 formally, but still kept verification and monitoring for a while.
  • As the new downstream was thoroughly verified, the old counterpart was removed.
  • We started to work on the new ad service.
  • Mock traffic was sent to generate ad events so we could verify the report outcome.
  • Now ads were served under v2 data model, translated from v1 model inside the ad server.
  • A new client side ad script module was made to do more experiment with the new route.
  • We switched to use the new ad script module, deprecating the old one.
  • We implemented the new management service, completing the last mile of the project.

Useful Tactics

Feature Flag

When we want to be more secure on a task that changes some critical system behavior, we can split the work into commits to minimize the code change in the one commit that really modifies the behavior. Optimally, the behavior-changing commit can be as small as a one-liner switch of feature flag. This method gives us a safer retreat scenario. A typical commit sequence would look like:

  1. Refactoring (create the feature flag mechanism)
  2. Switch feature flag
  3. Verification (we can stay in this stage for a while, until we feel secure)
  4. Refactoring (eliminate the feature flag mechanism)

I often illustrate this idea by the “three points of contact” principle in rock climbing.

Extra Verification with Production Data

Sometimes we don’t have the confidence to just roll out the change. In such case, we can let both new and legacy parts coexist for a while, verify and monitor the their outcomes, and discard the legacy part when we think it’s ready. A typical workflow would be:

  1. Build the new part and connect to the data flow upstream.
  2. The current anti-corruption layer takes outcomes from both parts, verify and monitor them.
  3. When the new part is thoroughly verified, switch the data path to use the new version. We can keep monitoring facility for a while just in case.
  4. Remove the legacy part and clean up.