Seamless Service Migration
Hooroo, in the company’s short 5 year lifespan, has done a lot of growing. We’ve already got a “legacy” application, which is naturally our bread-and-butter, moneymaker app. We’ve also got services that support or utilise that legacy application, and they’re nice and separate and only interact via HTTP api and all such nice things. And finally we’ve got the quintessential new platform rebuild, from scratch with new decisions around architecture, domain design, etc. Along with all of this, we have three or four different infrastructure setups, generally tracking along with these other stages of application – and that is uncomfortable. Today we’ll be discussing moving one of our applications from an “old” infrastructure to our current infrastructure.
So what are we looking at?
The application under the limelight today is one of our biggest drivers of traffic. It integrates with Qantas and Jetstar’s flight-booking platforms to generate an email of hotel offers to customers who have recently made a flight booking. It it thus named, potentially confusingly, “Flightbookings”. The infrastructure this was originally deployed within was using AWS’ Opsworks program.
Specifically, Flightbookings was deployed into Opsworks with a stock-standard Ubuntu 14.xx image, and was managed server-side with chef cookbooks that were partially opsworks provided. The cookbooks themselves proved difficult to work with when execution order needed to change, and upgrading chef cookbook versions was also another “fun” exercise. This alone meant that spinning up new hosts was slow, and since cookbooks were utilised in the deployment step, deployment was also slow. We’re talking eight-minutes slow. For a reasonably small application. We were also running our applications on EC2 instances outside of a VPC.
Our new infrastructure trades OpsWorks for CloudFormation, and Chef for Puppet. Puppet alone has been a big improvement over Chef in our case, though this may also have something to do with having complete control over the layout and execution order of our puppet recipes. Our deployment is now started as a tarball pushed to S3, which is then grabbed by CodeDeploy and dropped onto the server. We deploy into our own AMIs derived from Amazon Linux, and we run our tests and build in a CentOS6.5 docker container, allowing us to vendor gems at tar-time, as we have found them to be binary compatible. This all adds up to speed improvements both in terms of host spin-up and app deployment. Oh, and everything is deployed to a VPC, Security Groups are restrictive by default, and everything is private by default.
It could be said that this migration started sometime early last year. Originally, Flightbookings would receive a payload from either of the upstream services and process it effectively immediately. This is great in terms of development simplicity, but is really no good in terms of maintenance or any other case in which the application is removed from service, where we may lose or “miss out” on the data being sent to us. So the first thing that really happened for this migration was we created a completely new service solely to collect the payloads from the upstream services, and place them on a queue for subsequent processed by Flightbookings. The new service was thusly named “Flightbooking-Events”, is deployed in our new infrastructure, and deployed with CodeDeploy. So from the point of view of Qantas or Jetstar, we are still happily receiving their payloads – no downtime whatsoever.
A few things needed in prep
Fast-forward to this year. We start by setting up the CloudFormation stack as we want. In our case, we have two SQS workers and two web workers – yup, despite no longer being hit directly by Qantas or Jetstar, we still have a web frontend, mostly just providing endpoints for 3rd-party event notifications and the A/B testing UI. We set-up the database separate of the main stack, as we do not want the lifecycle of the apps to affect the lifecycle of our data – ie, if we need to rebuild the stack, we want to be able to do so without dropping the database unnecessarily. This all allows us to test our environment out before performing a data migration and going live within the new infrastructure. We actually did this twice – once for staging and once again for production – in a way making us doubly-safe, or at least more confident in the work succeeding.
We also set up an instance running Bucardo, with access to both databases on either end of the migration. Bucardo is a trigger based replication system for PostgreSQL, the database layer of preference here at Hooroo. Bucardo allows us to track changes on a host even when there is not another host to be replicated over to. It allows us to keep the old system up and running and processing events while we are pulling it’s data over to the new infrastructure. Note however that it is not the only mechanism we use to migrate the database over – we still use a database dump (actually an RDS snapshot) to do the majority of the heavy-lifting of data. Bucardo just allows us to catch-up on changes between dump and restore.
As it came down to the time for migrating, a plan was crafted. It would progress roughly like this:
- Start Bucardo recording deltas.
- Create a snapshot/dump of the database
- Restore the snapshot/dump into new database instance
- On the new database instance, drop tables and sequences added by Bucardo. These are from Bucardo being added on the old database, and will confuse Bucardo if the next step if left behind.
- Add new database instance into Bucardo as a part of the DBGroup as a target.
- Stop old web workers and SQS workers running
- Start bucardo sync to replicate changes to new database
- Once sync is complete, start new web workers and SQS workers
- Change DNS entry for web endpoint(s)
Total time of outage to migrate 150G DB … only a couple of minutes.
Note that we do actually have an outage on our web endpoints. This isn’t such a problem for us, as our 3rd parties are able to queue up messages while we’re not available. So one might say that this is not entirely seamless, but the affected functionality is peripheral to the domain and our customers do not see a difference.
And without too much fanfare, this is how the migration went. We migrated the staging app over, tested it and made sure it was all working correctly, then proceeded to move production over. No troubles, just one day (well, less) of work.