https://status.aws.amazon.com/
<<
API Error Rates in US-EAST-1
[9:37 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.
[10:12 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified root cause of the issue causing service API and console issues in the US-EAST-1 Region, and are starting to see some signs of recovery. We do not have an ETA for full recovery at this time.
[11:26 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. Services impacted include: EC2, Connect, DynamoDB, Glue, Athena, Timestream, and Chime and other AWS Services in US-EAST-1. The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region. We are pursuing multiple mitigation paths in parallel, and have seen some signs of recovery, but we do not have an ETA for full recovery at this time. Root logins for consoles in all AWS regions are affected by this issue, however customers can login to consoles other than US-EAST-1 by using an IAM role for authentication.
[12:34 PM PST] We continue to experience increased API error rates for multiple AWS Services in the US-EAST-1 Region. The root cause of this issue is an impairment of several network devices. We continue to work toward mitigation, and are actively working on a number of different mitigation and resolution actions. While we have observed some early signs of recovery, we do not have an ETA for full recovery. For customers experiencing issues signing-in to the AWS Management Console in US-EAST-1, we recommend retrying using a separate Management Console endpoint (such as
https://us-west-2.console.aws.amazon.com/). Additionally, if you are attempting to login using root login credentials you may be unable to do so, even via console endpoints not in US-EAST-1. If you are impacted by this, we recommend using IAM Users or Roles for authentication. We will continue to provide updates here as we have more information to share.
[2:04 PM PST] We have executed a mitigation which is showing significant recovery in the US-EAST-1 Region. We are continuing to closely monitor the health of the network devices and we expect to continue to make progress towards full recovery. We still do not have an ETA for full recovery at this time.
[2:43 PM PST] We have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. We are seeing improvement in availability across most AWS services. All services are now independently working through service-by-service recovery. We continue to work toward full recovery for all impacted AWS Services and API operations. In order to expedite overall recovery, we have temporarily disabled Event Deliveries for Amazon EventBridge in the US-EAST-1 Region. These events will still be received & accepted, and queued for later delivery.
[3:03 PM PST] Many services have already recovered, however we are working towards full recovery across services. Services like SSO, Connect, API Gateway, ECS/Fargate, and EventBridge are still experiencing impact. Engineers are actively working on resolving impact to these services.
[4:35 PM PST] With the network device issues resolved, we are now working towards recovery of any impaired services. We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard.
>>
Enige wat nog niet helemaal werkt is dit:
<<
Amazon Elastic Container Service (N. Virginia) Elevated Fargate task launch failures
3:32 PM PST ECS has recovered from the issue earlier in the day, but we are still investigating task launch failures using the Fargate launch type. Task launches using the EC2 launch type are not impacted.
4:44 PM PST ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. We have identified the root cause for the increased Fargate launch failures and are working towards recovery.
5:31 PM PST ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. We have identified the root cause for the increased Fargate launch failures and are starting to see recovery. As we work towards full recovery, customers may experience insufficient capacity errors and these are being addressed as well.
7:30 PM PST ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. Fargate task launches are currently experiencing increased insufficient capacity errors. We are working on addressing this. In the interim, tasks sizes smaller than 4vCPU are less likely to see insufficient capacity errors.
11:01 PM PST ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. Fargate task launches are currently experiencing increased insufficient capacity errors. We are working on addressing this and have recently seen a decrease in these errors while continuing to work towards full recovery. In the interim, tasks sizes smaller than 4vCPU are less likely to see insufficient capacity errors.
>>
-------------------------------------------
Eerder:
Zo is bijvoorbeeld
https://aws.amazon.com/marketplace plat (500 error, zie
https://archive.md/dACXu)
Hacker News was er al vroeg bij:
https://news.ycombinator.com/item?id=29473630 via
https://twitter.com/hn_frontpage/status/1468245386326388752
Veel van AWS draait ook in US-EAST-1, dus heeft AWS moeite om status te rapporteren en de boel te fixen (zelfs root accounts doen het niet meer):
https://twitter.com/YuzukiHusky/status/1468302440156086281 en
https://twitter.com/d_feldman/status/1468265185630687233 en
https://twitter.com/ipmb/status/1468245893279363088
Dat was de reden waarom o.a. Hacker News zo vroeg was.
De eerste melding van Amazon zelf (gevonden via de howisthecloud cloud status aggregator op Twitter):
https://twitter.com/howisthecloud/status/1468270456981667845 en van een werknemer
https://twitter.com/amitkjha_rjn/status/1468270518012891150
Dit was de eerste tweet die het probleem rapporteerde:
https://twitter.com/tarparara/status/1468243375631568911
Mooi volgtijdelijk twitter draadje:
https://twitter.com/nixcraft/status/1468247487190343687
Voelt op een bepaalde manier als de Facebook outage van laatst: die ene zwakke schakel die niemand in het vizier had.
Hopelijk komt de management console snel weer in de lucht. Die is nu plat:
https://us-east-1.console.aws.amazon.com/console/home geeft nu een 504 error en was eerder nog wel een soort van "in de lucht" met een pagina "unavailable":
https://archive.md/nadvS
Het regent cynische opmerkingen op Twitter zoals
https://twitter.com/PraneetSahgal/status/1468251344876285952,
https://twitter.com/rmogull/status/1468285279202996225 en
https://twitter.com/regis_alenda/status/1468286588169895941
Maar ook mensen die nu snappen dat eieren in meerdere mandjes handig kan zijn (ook al kost het geld):
https://twitter.com/zerodmg/status/1468274651407159301
En Kris (doet indrukwekkende cloud dingen bij Booking) heeft weer wat interessante links in zijn draadje:
https://twitter.com/isotopp/status/1468328058289631235
Op zich snap ik dat wel, maar aan de andere kant: zelfs de cloud bedrijven zitten kennelijk nog in een leercurve en ik zou niet graag in de stoel zitten van de mensen die het nu aan het fixen zijn.
Dus ik ben het helemaal met deze #hugops eens:
https://twitter.com/theseanodell/status/1468257178754723842
(Anderen zijn vast beter in de Tweakers markup dan ik, dus hou het even op simpele text)
[Reactie gewijzigd door wiert.tweakers op 22 juli 2024 18:35]