We'd like to keep Spreaker up 100% of the time. When that doesn't happen, we write about it here.
|Website||So far so good|
|Api||So far so good|
|Streaming||So far so good|
|Mobile apps||So far so good|
Spreaker mainly runs on a Postgresql database. We currently have two shards, each one in a master-slave streaming replication setup. Each database instance runs on AWS EC2 with four EBS SSD provisioned IOPS disks, in a RAID 0 (stripe) configuration.
Last night, at about 01:05 UTC, we noticed a slow down of two EBS volumes attached to our master #1 database. The slow down was intermittent and still acceptable, so we decided to keep an eye on it and just wait. Unfortunately, at 02:00 UTC, such volumes suddenly stopped working and master #1 database went down.
We immediately elevated the slave database to master, redirecting both read and write queries to a single database instance (instead of splitting the load between two instances). Despite the successful slave-to-master switch, the single instance was unable to process all requests and we hit a hardware limit (500Mb/s EBS bandwidth) that led to another slow down. We started the process to create a new replica, that took more time than expected: once ready, at 03:20 UTC, the workload had been split across master and slave, and the slow down disappeared.
Tomorrow morning, at 8:00 UTC, we’ll put Spreaker in maintenance mode for about 10 minutes, in order to upgrade our database instances. We’ll double the RAM of each instance and migrate to an instance with a 1Gb/s EBS bandwidth cap.