Inconsistent platform functionality – Mar 9

Home Forums General Inconsistent platform functionality – Mar 9

New Topic

Tagged: ,

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #115303
    Gabriel
    Moderator

    Between around 15:20-16:40 UTC on Tue, a malfunction of the main database has led to repeated inadequacies in functionality of multiple services.

    Among other symptoms, wordpress backends have been at times unable to register or start new events, certain ongoing live events have been turned off or discontinued, and streaming bandwidth consumption has not been accounted for. Overall this has affected an estimated 25% of customers.

    Issue has been acknowledged immediately, however efforts to restore full functionality lasted more than an hour. During this time some services have been restarted or rerouted, leading to spotty downtime of the ‘my-account’ area and also the whole wpstream.net website in some regions.

    Currently the issue has been isolated and full functionality is restored. Investigation into the main cause of the problem is still ongoing and we are tightly monitoring the services until a definitive fix is in place.

    Will update the thread with details as soon as available.

    #116088
    Christina
    Moderator

    Root of the issue has been pinpointed to a glitch in the OAuth plugin. As we’ve been running a customized older version of it, we’ll need to take the time to customize its latest version and roll it out in production. Expecting this to take up to 10 days, we’ll meanwhile closely monitor the infrastructure for recurrence of the problem.

    #117392
    Gabriel
    Moderator

    A permanent solution has been gradually deployed in production over the last few days and it is now complete.

    We sincerely apologize for any inconvenience you may have experienced and appreciate your patience while we worked to resolve this.

    #118435
    Gabriel
    Moderator

    The issue has resurfaced and platform has once again been functioning inconsistently for about 30 minutes, starting 8:09 GMT
    We are diagnosing and will post updates in here.

    #118525
    Christina
    Moderator

    Very much similar to the first occurrence of the matter, what appears to have been faulty/inefficient routines in the OAuth plugin led to a database overload causing all sorts of symptoms (see above). We realize that having been reassured by the plugin creators that this would not happen again once we upgrade (and yet it did) is no good excuse. So far, we’ve managed to apply the following countermeasures:

    • drastically lighten the DB, many of the oAuth specific entries are old (yet not properly recycled), redundant, or unneeded
    • increase overall DB capacity and scalability
    • improved the alerting system to be notified of a similar failure minutes before it actually happens

    Steps to be taken over the next few days:

    • get back in touch with the plugin creators in hope of permanent solutions
    • further investigate the sum of factors that led to this

    Longer term goals:

    • handle authentication via a different plugin or a home grown solution if we are unable to sort out the current system
    • fully separate website logic from streaming platform functionality, so as to ensure streaming operability independent of other components

    We apologize once again for the issues that this may have caused and can reassure you that we are taking serious measures to overcome this for good. Thank you for your patience and continuous cooperation.

    #121106
    admin-0326
    Participant

    Hi there,

    I am a CTO and WPStream customer with (currently) three separate WPStream accounts serving three of my own clients. Total spend right now is about $5000/year and if (big If! :-)) I hit my business goals that could be $50k by the end of this year, and hopefully continue growing fast from there.

    Following the outages in March, one of which affected a live client event very badly, I would be grateful for some information from you to inform our decision about remaining with WPStream (I’m under pressure from my CEO)…

    1. What is the standard uptime guarantee from WPStream?
    2. How much downtime has there been in the past 2 years?
    3. As well as the actions you list in your post above, do you have a plan to improve resilience by locating copies of databases and servers in different availability zones, to prevent single site / data centre failures causing a service wide outage, such as happened in March?

    I would be grateful for those answers and any other information about service levels and further actions around resilience you can give me, please.

    Many thanks,

    Chris

    #121245
    chris-3374
    Participant

    One last additional question on the above (^^ the last post was me too from a different account): please could you give an indication of the scale of the WPStream platform in terms of its bandwidth requirement, for example in terms of the total monthly streaming bandwidth taken up by all streams served by WPStream, or another example of your choosing? This goes to assist our understanding of the scale of the service, by which standard I would presume our data bandwidth requirements are tiny, by comparison.

    • This reply was modified 5 months ago by chris-3374.
    #121441
    Christina
    Moderator

    Hey Chris,

    1. per the SaaS agreement the service has to be functional at least 99.9% and we’re proud to be well within that
    2. we had to deal with the following disruptive events last year

    https://wpstream.net/forums/topic/downtime-mar-19-2020/
    https://wpstream.net/forums/topic/failure-to-start-live-events-mar-24/
    https://wpstream.net/forums/topic/live-events-turning-off-inadvertedly-nov-25th/
    https://wpstream.net/forums/topic/all-videos-are-missing/
    https://wpstream.net/forums/topic/inconsistent-platform-functionality-dec-28/

    3. the platform is built for fail-over and incorporates multiple redundancy strategies; thus some of the events above have only affected part of our customers and we’ve most times been able to ensure crucial functionality while working to restore full service; we’re striving to do better, learning and improving with every mishap

    4. at peak we run many thousands of live events simultaneously; some have had hundreds of thousands of concurrent viewers

    #121533
    chris-3374
    Participant

    Many thanks. This is much appreciated.

    #121710
    Beatrice
    Moderator

    Happy to be of service, Chris. Please do not hesitate to contact us with any more questions 🙂

Viewing 10 posts - 1 through 10 (of 10 total)
  • You must be logged in to reply to this topic.