The "new" phrase being Discussed this
week is
Continuous Deployment, where people are proud of pushing out
updates 50 times a day.
I don't personally view this as a new idea, and will point to my
2002 paper Making
Web Services that work as one publication where I use the exact
phrase; think I did it in Java Development with Ant too. So I am
pleased that other people have discovered what I've been doing -
and talking about- for what, 8+ years?
Which is why I understand the limitations, and think 50x a day
is pretty steep, even though they are using some really good stats
to tune the rollout
First problems are technical
- Even once cost-of-deploy is reduced to $0, there is
cost-of-rollback to consider, which is a function of time to
retrieve lost data/converted databases, cost of lost transactions,
and any downtime.
- If you are providing a public API -and all RESTy or WS-* sites
are- there's the risk of API change; that's not something you do
lightly
- Switching part of a cluster over makes for fun state coherence
if you are using serialized objects over a tuplespace.
Those are all technical. Here is my other objection, which is
organisational. If you can push out an update in 15 minutes,
management will expect you to have every bug fixed 16 minutes after
it is reported to you.. They will start checking the web site
at T+16:00, phoning you at T+17:00, getting really concerned at
T+30:00, and viewing T+1:00:00 as unacceptable. If your customers
ever discover that you can push out a minor fix "change the return
xsd:type" in half an hour, they will expect a change that seems
equally minor to them "stop losing my files" out in an hour, even
if one is a trivial constant change, the other being track down a
consistency bug which may be in software, or it may be a RAID
controller playing up.
What then to do?
- Staging site can be on CI, hooked in with your continuous
integration tool.
- Deploy to real or virtual clusters. VMs have funny clock drift
problems, but are good for simulating partitions.
- Have functional tests that not only verify correct behaviour
when the infrastructure is live, but try try mocking classic system
failures (no DNS, clock inconsistencies across the cluster) to see
what happens.
- If you do partial rollout across a cluster, test in that
state
- Test rollback
- Only push out system updates once a day
- Make that daytime a time when the developers are in the office,
ready to field problems, rather than at home where only the ops
team are left to field the calls
- Publish the update schedule on an RSS feed that is on a
different site/router from the test system.
CI is like a chainsaw: very powerful, can achieve great things,
but if handled badly you can cut your own legs off and that
hurts.