We'd like to keep Spreaker up 100% of the time. When that doesn't happen, we write about it here.

Website So far so good
Api So far so good
Streaming So far so good
Mobile apps So far so good

29 January 2015

15:44 CET

Post-mortem analysis

Spreaker mainly runs on a Postgresql database. We currently have two shards, each one in a master-slave streaming replication setup. Each database instance runs on AWS EC2 with four EBS SSD provisioned IOPS disks, in a RAID 0 (stripe) configuration.

What happened

Last night, at about 01:05 UTC, we noticed a slow down of two EBS volumes attached to our master #1 database. The slow down was intermittent and still acceptable, so we decided to keep an eye on it and just wait. Unfortunately, at 02:00 UTC, such volumes suddenly stopped working and master #1 database went down.

We immediately elevated the slave database to master, redirecting both read and write queries to a single database instance (instead of splitting the load between two instances). Despite the successful slave-to-master switch, the single instance was unable to process all requests and we hit a hardware limit (500Mb/s EBS bandwidth) that led to another slow down. We started the process to create a new replica, that took more time than expected: once ready, at 03:20 UTC, the workload had been split across master and slave, and the slow down disappeared.

What’s next

Tomorrow morning, at 8:00 UTC, we’ll put Spreaker in maintenance mode for about 10 minutes, in order to upgrade our database instances. We’ll double the RAM of each instance and migrate to an instance with a 1Gb/s EBS bandwidth cap.

Looking for help?

If you need any assistance, please contact us via our customer support service or drop us an email.