Multiple system components degraded, some algorithm execution calls and other functions may fail.
Incident Report for Algorithmia
Resolved
Modification of storage infrastructure and some service instances in the metrics sub-system have been completed and all platform systems are operating normally at this time.
Posted Nov 19, 2020 - 08:52 PST
Monitoring
System was fully recovered from this incident and is in a health operating state. Operations team continuing to monitor metrics subsystem. Root cause and remediation efforts on going.
Posted Nov 17, 2020 - 08:25 PST
Update
Recovery process ongoing. metrics functionality is fully restored. Algorithm request queues exceeded internal limits during the outage. these queues are being manually drained. Some algorithms will still be experiencing timeouts or other error responses.
Posted Nov 16, 2020 - 10:35 PST
Identified
Internal metrics service experienced a resource limit which impacted algorithm request processing and scaling.
Posted Nov 16, 2020 - 10:11 PST
Investigating
We are continuing to see periodic episodes of loss of communication between certain machine nodes in the platform. This issue has increased in frequency over the past few weeks. Mitigation efforts are in progress now and an upgrade of some core components is being tested for role out in the next 48 hours.
Posted Nov 16, 2020 - 08:57 PST
This incident affected: Algorithm API.