We'd like to keep Spreaker up 100% of the time. When that doesn't happen, we write about it here.
|Website||So far so good|
|Api||So far so good|
|Streaming||So far so good|
|Mobile apps||So far so good|
Today, between 05:19 and 05:21 UTC (3 minutes) and between 12:19 and 12:29 UTC (10 minutes) Spreaker website has been down due to a issue with one of our slave databases.
The host running our primary slave database got an issue, causing the database server being unreachable. Despite a down of a slave server should not affect our service, we’ve found an edge case that cause database connections to get stuck and thus Spreaker webservers requests as well.
Tomorrow, April 27th at 6:00 UTC (8 AM CEST, April 26th at 10 PM Pacific Time), Spreaker will be in maintenance mode for up to 15 minutes to upgrade our database servers.
We’ll both increase the server CPU and SDD drives throughput in order to better handle increasing traffic.
UPDATE at 6.00 AM UTC: Spreaker is going to be under maintenance for up to 15 minutes.
UPDATE at 6.15 AM UTC: maintenance has been completed and Spreaker is now fully working. Maintenance lasted 8 minutes. We’re now monitoring the system and running some secondary operations that will last few more hours, without affecting the service.
UPDATE at 6.47 AM UTC: most of the secondary operations have been completed. We’ll keep going on monitoring system health and performances. No more updates will be published, until the system works as expected. Thank you for your patience.
We’re currently experiencing an huge load on the platform, that’s affecting many services. We’re investigating it and keep you posted here.
UPDATE at 20.25 UTC: Spreaker has received an expected high volume traffic from Egypt. Despite our system is designed to automatically scale and absorb such traffic, it actually lead to a performance issue in our primary database cluster. As a temporary solution, we had to block some of such traffic in order to restore the service in other countries. The service is now operating normally in US and Europe, while still investigating the root cause of the performance issue we got on our databases.
UPDATE at 21:06 UTC: we’ve found the root cause of the performance issue on the primary database cluster and we’re already rolled out an hot patch to avoid the same issue in the near future. We’ll continue our investigation and we’ll likely schedule a maintenance window to upgrade our database servers to faster servers, in order to better absord high-load peaks.
Due to a mistake in the release process, this morning we released a broken version of our Android Radio app (4.0.4), prone to crashes during the application’s startup.
We worked around the clock to fix the issue as soon we noticed it, and we’ve already published an updated version (4.0.5) on the Play Store..
This version is already available for automatic update. If you experienced this issue and your application has not been automatically updated yet, please open the “Play Store” app in your Android device, and visit the “My Apps” section to manually update it.
We’re very sorry about what happened, and we’re already working on improving our continuous integration pipeline in order to avoid similar issues from happening again in the future.
Thanks for your patience.
We’re currently having some networking issues in our primary datacenter run by AWS.
UPDATE at 15:00 UTC: networking issues still ongoing, but should affect a small number of users. We’re monitoring networking connectivity from multiple locations and the issue’s impact is currently reducing over time.
UPDATE at 15:15 UTC: AWS just reported that’s “investigating elevated packet loss between some Internet destinations and the EU-WEST-1 Region”.
UPDATE at 15.36 UTC: An external facility providing some connectivity to the AWS EU-WEST-1 Region has experienced power loss. AWS is currently working with the service provider to mitigate impact and restore power.
UPDATE at 16:23 UTC: AWS recovered power in the impacted facility and is continuing to investigate and resolve intermittent packet loss and latency between some Internet destinations and the EU-WEST-1 Region.
RESOLVED at 16:30 UTC: AWS confirmed the issue has been solved.
Playback currently doesn’t work on latest Firefox when you navigate the Spreaker’s website via HTTPS, due to stronger security policies. We’re working on it and we plan to get it fixed very soon.
In the meantime, we suggest to use a different browser (ie. Google Chrome) or temporarily navigate Spreaker via HTTP.
Thanks for your patience.
UPDATE at 11:00 UTC: the issue has been fixed now. We’re monitoring the infrastructure to ensure everything runs smooth now. We’re really sorry for the inconvenience.
Since yesterday, YouTube sharing is not working. We’re currently fixing it and re-uploading to YouTube all failed videos. This could take some time, due to the huge workload.
Thanks for your patience.
UPDATE at 15:20 UTC: all failed videos have been reprocessed and successfully uploaded to YouTube.
Since yesterday, Twitter login has not been working on our iOS applications. We’re working hard to fix it, and a new release will be uploaded in a few hours - though unfortunately, it could take a few days before it will be available for download on your devices, due to the Apple review process.
Now that your Twitter and Spreaker accounts are connected, as soon as the new release will be out in the App Store, you will be able to use the 1-click login feature to access your account.
As announced yesterday, Spreaker will be under maintenance for 15 minutes from 8:00 to 8:15 UTC. We’re going to upgrade our database servers, as countermeasure taking place after the issues we got yesterday.
UPDATE at 8:10 UTC: database servers have been upgraded successfully. Spreaker is back. Thanks for your patience.
Spreaker mainly runs on a Postgresql database. We currently have two shards, each one in a master-slave streaming replication setup. Each database instance runs on AWS EC2 with four EBS SSD provisioned IOPS disks, in a RAID 0 (stripe) configuration.
Last night, at about 01:05 UTC, we noticed a slow down of two EBS volumes attached to our master #1 database. The slow down was intermittent and still acceptable, so we decided to keep an eye on it and just wait. Unfortunately, at 02:00 UTC, such volumes suddenly stopped working and master #1 database went down.
We immediately elevated the slave database to master, redirecting both read and write queries to a single database instance (instead of splitting the load between two instances). Despite the successful slave-to-master switch, the single instance was unable to process all requests and we hit a hardware limit (500Mb/s EBS bandwidth) that led to another slow down. We started the process to create a new replica, that took more time than expected: once ready, at 03:20 UTC, the workload had been split across master and slave, and the slow down disappeared.
Tomorrow morning, at 8:00 UTC, we’ll put Spreaker in maintenance mode for about 10 minutes, in order to upgrade our database instances. We’ll double the RAM of each instance and migrate to an instance with a 1Gb/s EBS bandwidth cap.