Between around 15:20-16:40 UTC on Tue, a malfunction of the main database has led to repeated inadequacies in functionality of multiple services.
Among other symptoms, wordpress backends have been at times unable to register or start new events, certain ongoing live events have been turned off or discontinued, and streaming bandwidth consumption has not been accounted for. Overall this has affected an estimated 25% of customers.
Issue has been acknowledged immediately, however efforts to restore full functionality lasted more than an hour. During this time some services have been restarted or rerouted, leading to spotty downtime of the ‘my-account’ area and also the whole wpstream.net website in some regions.
Currently the issue has been isolated and full functionality is restored. Investigation into the main cause of the problem is still ongoing and we are tightly monitoring the services until a definitive fix is in place.
Will update the thread with details as soon as available.
Root of the issue has been pinpointed to a glitch in the OAuth plugin. As we’ve been running a customized older version of it, we’ll need to take the time to customize its latest version and roll it out in production. Expecting this to take up to 10 days, we’ll meanwhile closely monitor the infrastructure for recurrence of the problem.
Very much similar to the first occurrence of the matter, what appears to have been faulty/inefficient routines in the OAuth plugin led to a database overload causing all sorts of symptoms (see above). We realize that having been reassured by the plugin creators that this would not happen again once we upgrade (and yet it did) is no good excuse. So far, we’ve managed to apply the following countermeasures:
drastically lighten the DB, many of the oAuth specific entries are old (yet not properly recycled), redundant, or unneeded
increase overall DB capacity and scalability
improved the alerting system to be notified of a similar failure minutes before it actually happens
Steps to be taken over the next few days:
get back in touch with the plugin creators in hope of permanent solutions
further investigate the sum of factors that led to this
Longer term goals:
handle authentication via a different plugin or a home grown solution if we are unable to sort out the current system
fully separate website logic from streaming platform functionality, so as to ensure streaming operability independent of other components
We apologize once again for the issues that this may have caused and can reassure you that we are taking serious measures to overcome this for good. Thank you for your patience and continuous cooperation.
I am a CTO and WPStream customer with (currently) three separate WPStream accounts serving three of my own clients. Total spend right now is about $5000/year and if (big If! :-)) I hit my business goals that could be $50k by the end of this year, and hopefully continue growing fast from there.
Following the outages in March, one of which affected a live client event very badly, I would be grateful for some information from you to inform our decision about remaining with WPStream (I’m under pressure from my CEO)…
1. What is the standard uptime guarantee from WPStream?
2. How much downtime has there been in the past 2 years?
3. As well as the actions you list in your post above, do you have a plan to improve resilience by locating copies of databases and servers in different availability zones, to prevent single site / data centre failures causing a service wide outage, such as happened in March?
I would be grateful for those answers and any other information about service levels and further actions around resilience you can give me, please.
One last additional question on the above (^^ the last post was me too from a different account): please could you give an indication of the scale of the WPStream platform in terms of its bandwidth requirement, for example in terms of the total monthly streaming bandwidth taken up by all streams served by WPStream, or another example of your choosing? This goes to assist our understanding of the scale of the service, by which standard I would presume our data bandwidth requirements are tiny, by comparison.
3. the platform is built for fail-over and incorporates multiple redundancy strategies; thus some of the events above have only affected part of our customers and we’ve most times been able to ensure crucial functionality while working to restore full service; we’re striving to do better, learning and improving with every mishap
4. at peak we run many thousands of live events simultaneously; some have had hundreds of thousands of concurrent viewers