We'd like to keep Spreaker up 100% of the time. When that doesn't happen, we write about it here.
|Website||So far so good|
|Api||So far so good|
|Streaming||So far so good|
|Mobile apps||So far so good|
Since yesterday, Twitter login has not been working on our iOS applications. We’re working hard to fix it, and a new release will be uploaded in a few hours - though unfortunately, it could take a few days before it will be available for download on your devices, due to the Apple review process.
Now that your Twitter and Spreaker accounts are connected, as soon as the new release will be out in the App Store, you will be able to use the 1-click login feature to access your account.
As announced yesterday, Spreaker will be under maintenance for 15 minutes from 8:00 to 8:15 UTC. We’re going to upgrade our database servers, as countermeasure taking place after the issues we got yesterday.
UPDATE at 8:10 UTC: database servers have been upgraded successfully. Spreaker is back. Thanks for your patience.
Spreaker mainly runs on a Postgresql database. We currently have two shards, each one in a master-slave streaming replication setup. Each database instance runs on AWS EC2 with four EBS SSD provisioned IOPS disks, in a RAID 0 (stripe) configuration.
Last night, at about 01:05 UTC, we noticed a slow down of two EBS volumes attached to our master #1 database. The slow down was intermittent and still acceptable, so we decided to keep an eye on it and just wait. Unfortunately, at 02:00 UTC, such volumes suddenly stopped working and master #1 database went down.
We immediately elevated the slave database to master, redirecting both read and write queries to a single database instance (instead of splitting the load between two instances). Despite the successful slave-to-master switch, the single instance was unable to process all requests and we hit a hardware limit (500Mb/s EBS bandwidth) that led to another slow down. We started the process to create a new replica, that took more time than expected: once ready, at 03:20 UTC, the workload had been split across master and slave, and the slow down disappeared.
Tomorrow morning, at 8:00 UTC, we’ll put Spreaker in maintenance mode for about 10 minutes, in order to upgrade our database instances. We’ll double the RAM of each instance and migrate to an instance with a 1Gb/s EBS bandwidth cap.
We’re currently switching the primary database to another server, in order to recover from slow performance issues. During this timeframe, Spreaker is unavailable. We’re really sorry for the inconvenience.
UPDATE at 02:50 UTC: Spreaker is currently available, but still very slow. The primary database has been successfully migrated, but it’s slow to reply due to missing read replicas. We’re currently creating database replicas, that’s taking more time than expected.
UPDATE at 03:33 UTC: Spreaker is now working. We’re really sorry for the inconvenience. Tomorrow, we’ll do a deep post-mortem analysis and we’ll post a plan to improve the database recovery process, reducing the down time in case it will happen again.
We’re currently experiencing slow response times, due to a performance issue on our primary database server. We’re investigating it.
Streaming servers are currently under heavy load and you may not be able to listen to Spreaker audio tracks. We’re turning on more servers: it should be fixed in few minutes.
From 18:40 till 19:05 UTC we got a failure on 1 web server: during this timeframe, all requests on that server failed with 503 error. The issue is now fixed.
From 01:33 till 09:04 UTC we got a failure on some European servers hosted in Topix, Italy. European users may have experienced a connection drop, but they have been forwarded to other locations in few minutes, thanks to our high-available and fault-tolerant infrastructure.
Our service provider AWS has some serious networking issues on the CDN service (CloudFront) and you may experience some issues while using Spreaker. This outage is affecting Spreaker and other thousands web applications around the globe. We’re really sorry for the inconvenience and we hope AWS will fix it soon.
UPDATE at 01:00 UTC: AWS is currently investigating increased error rates for DNS queries for CloudFront distributions.
UPDATE at 01:38 UTC: the service is gradually restoring. You should now be able to access most Spreaker services (including website), yet some random failures could still occur.
We’re experiencing intermittent networking issues on Spreaker. We’re alerted Amazon Web Services and we hope they will fix the issue soon.
UPDATE at 15:26 UTC: AWS is investigating Internet provider connectivity issues in the EU-WEST-1 Region.
FIXED at 15:58 UTC: according to AWS, “we experienced impaired Internet connectivity affecting some instances in the EU-WEST-1 Region. The issue has been resolved and the service is operating normally.”