Yesterday starting at around 2022-10-10T10:25:00Z (UTC) we started seeing significant performance degradation when trying to request jobs on production, with periodic 503s and 504s, as well as response times longer than 100 seconds (the max timeout duration on our end). The issues continued to persist until around 2022-10-10T11:24:00Z.
Some of the 503 responses included the following message content (note the raw html, as opposed to the expected API error response format):
<h2>This website is under heavy load (queue full)</h2><p>We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.</p>
We’re still diagnosing the magnitude of impact this presumed Stuart downtime had on operations. While we continue to conduct a post mortem, can you please provide some information, confirming the incident on your end and any additional information (e.g. duration, root cause, extent of impact, etc)?
Indeed, yesterday we experienced issues with our API.
You probably received a notification on the email address you are using with your Stuart account.
If this is not the case you can reach out to cse@stuart.com and we will be able to add you to the list in order to receive such notifications in the future and to receive the postmortem that we will be releasing in the next few days.
For more information on related Tips & Best practices please see our post Incidents & Outages
Hi @Adrien, we still haven’t seen any postmortem come through. Has one not been sent yet (in which case, when should we expect this to occur), or should we check to see if there was an issue getting added to the notification mailing list?
The postmortem was sent last week.
You are probably not in our list yet. Could you please reach out to cse@stuart.com,
so that we will send you the postmortem and add you to the list.