Inconsistent platform functionality – Mar 9
- March 9, 2021 at 4:00 pm #115303
Between around 15:20-16:40 UTC on Tue, a malfunction of the main database has led to repeated inadequacies in functionality of multiple services.
Among other symptoms, wordpress backends have been at times unable to register or start new events, certain ongoing live events have been turned off or discontinued, and streaming bandwidth consumption has not been accounted for. Overall this has affected an estimated 25% of customers.
Issue has been acknowledged immediately, however efforts to restore full functionality lasted more than an hour. During this time some services have been restarted or rerouted, leading to spotty downtime of the ‘my-account’ area and also the whole wpstream.net website in some regions.
Currently the issue has been isolated and full functionality is restored. Investigation into the main cause of the problem is still ongoing and we are tightly monitoring the services until a definitive fix is in place.
Will update the thread with details as soon as available.March 11, 2021 at 9:54 am #116088
Root of the issue has been pinpointed to a glitch in the OAuth plugin. As we’ve been running a customized older version of it, we’ll need to take the time to customize its latest version and roll it out in production. Expecting this to take up to 10 days, we’ll meanwhile closely monitor the infrastructure for recurrence of the problem.March 24, 2021 at 9:48 am #117392
A permanent solution has been gradually deployed in production over the last few days and it is now complete.
We sincerely apologize for any inconvenience you may have experienced and appreciate your patience while we worked to resolve this.March 30, 2021 at 8:51 am #118435
The issue has resurfaced and platform has once again been functioning inconsistently for about 30 minutes, starting 8:09 GMT
We are diagnosing and will post updates in here.March 30, 2021 at 1:03 pm #118525
Very much similar to the first occurrence of the matter, what appears to have been faulty/inefficient routines in the OAuth plugin led to a database overload causing all sorts of symptoms (see above). We realize that having been reassured by the plugin creators that this would not happen again once we upgrade (and yet it did) is no good excuse. So far, we’ve managed to apply the following countermeasures:
- drastically lighten the DB, many of the oAuth specific entries are old (yet not properly recycled), redundant, or unneeded
- increase overall DB capacity and scalability
- improved the alerting system to be notified of a similar failure minutes before it actually happens
Steps to be taken over the next few days:
- get back in touch with the plugin creators in hope of permanent solutions
- further investigate the sum of factors that led to this
Longer term goals:
- handle authentication via a different plugin or a home grown solution if we are unable to sort out the current system
- fully separate website logic from streaming platform functionality, so as to ensure streaming operability independent of other components
We apologize once again for the issues that this may have caused and can reassure you that we are taking serious measures to overcome this for good. Thank you for your patience and continuous cooperation.April 17, 2021 at 12:51 pm #121106admin-0326Participant
I am a CTO and WPStream customer with (currently) three separate WPStream accounts serving three of my own clients. Total spend right now is about $5000/year and if (big If! :-)) I hit my business goals that could be $50k by the end of this year, and hopefully continue growing fast from there.
Following the outages in March, one of which affected a live client event very badly, I would be grateful for some information from you to inform our decision about remaining with WPStream (I’m under pressure from my CEO)…
1. What is the standard uptime guarantee from WPStream?
2. How much downtime has there been in the past 2 years?
3. As well as the actions you list in your post above, do you have a plan to improve resilience by locating copies of databases and servers in different availability zones, to prevent single site / data centre failures causing a service wide outage, such as happened in March?
I would be grateful for those answers and any other information about service levels and further actions around resilience you can give me, please.
ChrisApril 18, 2021 at 1:29 pm #121245chris-3374Participant
One last additional question on the above (^^ the last post was me too from a different account): please could you give an indication of the scale of the WPStream platform in terms of its bandwidth requirement, for example in terms of the total monthly streaming bandwidth taken up by all streams served by WPStream, or another example of your choosing? This goes to assist our understanding of the scale of the service, by which standard I would presume our data bandwidth requirements are tiny, by comparison.
April 19, 2021 at 3:11 pm #121441
- This reply was modified 5 months ago by chris-3374.
1. per the SaaS agreement the service has to be functional at least 99.9% and we’re proud to be well within that
2. we had to deal with the following disruptive events last year
3. the platform is built for fail-over and incorporates multiple redundancy strategies; thus some of the events above have only affected part of our customers and we’ve most times been able to ensure crucial functionality while working to restore full service; we’re striving to do better, learning and improving with every mishap
4. at peak we run many thousands of live events simultaneously; some have had hundreds of thousands of concurrent viewersApril 20, 2021 at 6:31 am #121533chris-3374Participant
Many thanks. This is much appreciated.April 21, 2021 at 9:21 am #121710BeatriceModerator
Happy to be of service, Chris. Please do not hesitate to contact us with any more questions 🙂
- You must be logged in to reply to this topic.