Steve: Developing on the Edge - The Costs of Continuous Deployment
Steve: Developing on the Edge
Thoughts on development, Web-services, technology and mountains.
12Feb
Thu2009
The Costs of Continuous Deployment

The "new" phrase being Discussed this week is Continuous Deployment, where people are proud of pushing out updates 50 times a day.

I don't personally view this as a new idea, and will point to my 2002 paper Making Web Services that work as one publication where I use the exact phrase; think I did it in Java Development with Ant too. So I am pleased that other people have discovered what I've been doing - and talking about- for what, 8+ years?

Which is why I understand the limitations, and think 50x a day is pretty steep, even though they are using some really good stats to tune the rollout

First problems are technical

  1. Even once cost-of-deploy is reduced to $0, there is cost-of-rollback to consider, which is a function of time to retrieve lost data/converted databases, cost of lost transactions, and any downtime.
  2. If you are providing a public API -and all RESTy or WS-* sites are- there's the risk of API change; that's not something you do lightly
  3. Switching part of a cluster over makes for fun state coherence if you are using serialized objects over a tuplespace.

Those are all technical. Here is my other objection, which is organisational. If you can push out an update in 15 minutes, management will expect you to have every bug fixed 16 minutes after it is reported to you.. They will start checking the web site at T+16:00, phoning you at T+17:00, getting really concerned at T+30:00, and viewing T+1:00:00 as unacceptable. If your customers ever discover that you can push out a minor fix "change the return xsd:type" in half an hour, they will expect a change that seems equally minor to them "stop losing my files" out in an hour, even if one is a trivial constant change, the other being track down a consistency bug which may be in software, or it may be a RAID controller playing up.

What then to do?

  1. Staging site can be on CI, hooked in with your continuous integration tool.
  2. Deploy to real or virtual clusters. VMs have funny clock drift problems, but are good for simulating partitions.
  3. Have functional tests that not only verify correct behaviour when the infrastructure is live, but try try mocking classic system failures (no DNS, clock inconsistencies across the cluster) to see what happens.
  4. If you do partial rollout across a cluster, test in that state
  5. Test rollback
  6. Only push out system updates once a day
  7. Make that daytime a time when the developers are in the office, ready to field problems, rather than at home where only the ops team are left to field the calls
  8. Publish the update schedule on an RSS feed that is on a different site/router from the test system.

CI is like a chainsaw: very powerful, can achieve great things, but if handled badly you can cut your own legs off and that hurts.

Comments