Downtime 3/30 Explained
March 30 2012 Tweet
On March 30th, 2012 from 12:07PM until 12:59PM PDT the entire CloudApp service was down.
The reason for the downtime was an oversight adding a column to the database. The column was added with a default value which requires each row of the table to be updated. This change made it through to production because the test database wasn’t large enough to expose the issue ahead of time. When the migration took much longer than expected, we realized that we didn’t have sufficient privileges to stop query. We got in touch with Heroku and they where kind enough to help us out.
The first change we’re making to our deployment process is to run database migrations on a clone of the production database. This will stop similar mistakes before they reach production. In order to stop long running queries in the future, we’ll need to migrate to a new database. We’ll run the migration over the weekend and it’ll only require a minute or two of downtime. We’ll keep everyone posted when we’re ready to make this change at @cloudapp.