Which Azure service can identify all global service issues whether or not they are in use within your account? 2021

Thủ Thuật về Which Azure service can identify all global service issues whether or not they are in use within your account? Mới Nhất

Pro đang tìm kiếm từ khóa Which Azure service can identify all global service issues whether or not they are in use within your account? 2022-10-05 03:39:20 san sẻ Thủ Thuật về trong nội dung bài viết một cách Mới Nhất.

The AWS Global Cloud Infrastructure is the most secure, extensive, and reliable cloud platform, offering over 200 fully featured services from data centers globally. Whether you need to deploy your application workloads across the globe in a single click, or you want to build and deploy specific applications closer to your end-users with single-digit millisecond latency, AWS provides you the cloud infrastructure where and when you need it.

Nội dung chính

September 2022

Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0)

Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)

August 2022

Post Incident Review (PIR) –

Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

Post Incident Review (PIR) – Datacenter power sự kiện – West US 2

(Tracking ID MMXN-RZ0)

Post Incident Review (PIR) – Azure Key Vault – Provisioning Failures (Tracking ID YLBJ-790)

Post Incident Review (PIR) – Azure Communication Services – Multiple Regions (Tracking ID YTYN-5T8)

Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)

Post Incident Review (PIR) – SQL Database – West Europe (Tracking ID 3TBL-PD8)

Which Azure service can identify all global service issues?

Which feature within Azure alerts you to service issues that happen in Azure itself?

Which Azure Management service informs you about problems with the Azure platform itself and upcoming maintenance events?

Which tool within Azure is comprised of Azure status service health and Resource health?

With millions of active customers and tens

of thousands of partners globally, AWS has the largest and most dynamic ecosystem. Customers across virtually every industry and of every size, including start-ups, enterprises, and public sector organizations, are running every imaginable use case on AWS.

This page contains root cause analyses (RCAs) of previous service issues, each retained for 5 years. From November 20, 2019, this included RCAs for all issues about which we communicated publicly. From June 1, 2022, this includes RCAs for broad issues as described in our

documentation.

September 2022

9/7

Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0)

What happened?

Between 16:10 and 19:55 UTC on 07 Sep 2022, a subset of customers using Azure Front Door (AFD) experienced

intermittent availability drops, connection timeouts and increased latency. At its peak, this impacted approximately 25% of the traffic, and on average, 10% of the traffic that traverses through the AFD service during the impact window. Some customers may have seen higher failures if their traffic was concentrated in the edges or regions with higher impact. This could also have impacted customers’ ability to access other Azure services that leverage AFD, in particular the Azure management portal

and Azure Content Delivery Network (CDN).

What went wrong and why?

The AFD platform automatically balances traffic across our global network of edge sites. When there is a failure in any of our edge sites or an edge site becomes overloaded, traffic is automatically moved to other healthy edge sites in other regions where we have fallback capacity. It is because of this design that customers and end users don’t experience any issues in case of localized or

regionalized impact. In addition to that, we also have protections built in every single node to protect our platform from unusual traffic spikes corresponding to each domain hosted on AFD.

Between 15:15 and 16:44 UTC we observed 3 unusual traffic spikes for one of the domains hosted on AFD.

The first two spikes for this domain occurred at 15:15 UTC and 16:00 UTC on 07 of September, 2022 and were fully mitigated by AFD platform. A third spike that

occurred between 16:10 to 16:44 caused a subset of environments managing this traffic to go offline.

The first two spikes were successfully absorbed due the platform protection mechanisms, however the ones that were initiated during the third spike did not fully mitigate the unexpected increase due to the nature of the traffic pattern (different from the first two spikes). At this stage in our investigation, we believe that all three traffic spikes were malicious HTTPS flood attacks.

(Layer 7 volumetric DDOS attacks)

The malicious traffic spikes did not originate from a single region. We found that they were coming from all around the world. A combination of malicious traffic (3rd spike), large traffic ramp-up for legitimate traffic for other customer and degraded customer origins, resulted in overwhelming the resources of a few environments taking them offline and resulting in a 25% drop in overall availability during the third traffic spike

By design,

these environments will automatically recover and resume taking traffic once healthy. During this instance, users and our systems retried the requests, resulting in a larger build-up of requests. This build-up did not allow time for a subset of the environments to recover fully resulting in a subsequent 8% availability drop, for more than 3.5 hours following the traffic spike.

How did we respond?

We have automatic protection mechanisms in such events

which mitigate circa 2,000 DDoS attacks per day, and the record that we have mitigated in a day has been 4,296. (More information can be found here: https://azure.microsoft.com/en-us/blog/azure-ddos-protection-2021-q3-and-q4-ddos-attack-trends/). In addition to this, AFD platform also has in-built DDoS protection mechanisms on each node at both a system and an application

layer. These help for further mitigations in such cases. In this instance, these mechanisms significantly helped to absorb the first two spikes without any customer impact.

During the third spike, the platform protection mechanisms were partially effective, mitigating around 40% of the traffic. This significantly helped to limit global impact. For a larger duration, 8.5% of the overall AFD service, concentrated in some regions, was impacted by this issue. Some customers may have seen

higher failures if their traffic was concentrated in predominantly North America, Europe, or the APAC regions.

As our telemetry alerted us regarding impact on availability, we manually intervened. The first step was that we took manual action to further block the attack traffic. In addition, we expedited the AFD load balancing process which then enabled auto-recovery systems to work as designed. The systems worked by ensuring the most efficient load distributions in regions where

there was a large build-up of traffic. Once the environment recovered, we began to gradually bring AFD instances back trực tuyến to resume traffic management in a normal way. We were 100% globally recovered by 19:55 UTC.

How are we making incidents like this less likely or less impactful?

Although the AFD platform has built-in resiliency and capacity, we must continuously strive to improve through these lessons learned. We have a few previously planned repair items that

were inflight being either partially deployed and/or staged to be deployed. We believe that these repair items would have mitigated the third malicious traffic spike had they been in place before Sept 7th. We are now expediting these repair items that were scheduled for later this year and they should be completed in the next few weeks. These include:

Effectively tuning the protection mechanisms in the AFD nodes to mitigate the impact of this class of traffic patterns

in future. (Estimated completion September 2022)

Addressing issues identified in the current platform environment recovery process. This will reduce time to recover for each environment and will prevent environments from becoming overloaded. (Estimated completion September 2022)

Tooling to trigger ‘per customer’ failover until we have fully automated the traffic shifting mechanisms. This work is completed.

Improvements in dynamic rate limiting algorithm to ensure fairness

to legitimate traffic. (Estimated completion October 2022)

Improve existing proactive automatic communication process to notify customers more expeditiously. (Estimated completion October 2022)

How can we make our incident communications more useful?

Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question

survey, https://www.aka.ms/AzPIR/YV8C-DT0

9/7

Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)

What happened?

Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may

have resulted in an error or timeout.

Downstream Azure services that rely on Cosmos DB also experienced impact during this window – including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.

What went wrong and why?

Cosmos DB load balances workloads

across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance sự kiện in one of the frontend clusters in the North Europe region, causing the availability issues described above.

How did we respond?

Our monitors alerted us of the impact on

this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.

Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an

alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.

Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.

How are we making incidents like this less likely

or less impactful?

Already completed:

Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance.

In progress:

Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. (Estimated completion: October 2022)

Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. (Estimated

completion: October 2022)

How can customers make incidents like this less impactful?

Consider configuring your accounts to be globally distributed – enabling multi-region for your critical accounts would allow for a customer-initiated failover during regional service incidents like this one. For more details, refer to:

https://docs.microsoft.com/azure/cosmos-db/distribute-data-globally

More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:

https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency

Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

https://aka.ms/ash-alerts

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/3TPC-DT8

August 2022

8/30

Post Incident Review (PIR) –

Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

What happened?

Between 06:00 UTC on 30 Aug 2022 and 16:00 UTC on 31 Aug 2022, customers running Ubuntu 18.04 (bionic) Virtual Machines (VMs) who had Ubuntu Unattended-Upgrades enabled received a systemd version that resulted in Domain Name System (DNS) resolution errors. This issue was confined to Ubuntu version 18.04, but impacted all Azure regions including public and sovereign clouds.

Downstream

Azure services that rely on impacted Ubuntu VMs also experienced impact during this window – including Azure Kubernetes Service (AKS), Azure Monitor, Application Insights, Log Analytics and Microsoft Sentinel. AKS customers could have experienced pod creation errors such as ImagePullBackoff as kubelet was unable to resolve DNS names of container registry. Customers may have experienced an inability to access Azure Monitor, Application Insights, Log Analytics, and/or Microsoft Sentinel log data,

and may have noticed missed or delayed Log Search alerts and/or Microsoft Sentinel alerts.

What went wrong, and why?

At 06:00 UTC on 30 August 2022, a Canonical Ubuntu security update was published – so Azure VMs running Ubuntu 18.04 (bionic) with unattended-upgrade enabled started to tải về and install the new packages, including systemd version 237-3ubuntu10.54. This led to a loss of their DNS configurations due to a race-condition bug:

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119.

The manifestation of this bug was triggered due to the combination of this and a previous update. This bug only affects systems using a driver name to identify the proper Network Interface Card (NIC) in their network configuration, which is why this issue impacted Azure uniquely and not other major cloud providers. This resulted in DNS

resolution failures and network connectivity issues for Azure VMs running Ubuntu 18.04 (bionic). As a result, other services dependent on these VMs were impacted by the same DNS resolution issues.

When unattended-upgrades are enabled, security updates are automatically downloaded and applied once per day by default. Considering their criticality, security updates like these do not go through our Safe Deployment Practices (SDP) process. However, we are reviewing this process to ensure that

we minimize customer impact during incidents like these.

How did we respond?

Multiple Azure teams detected the issue shortly after the packages were published via production alerts, including our AKS and Azure Container Apps service teams. Upon investigation, we identified the root cause as the bug in Ubuntu mentioned above, and began engaging other teams to explore appropriate mitigations. During this time, incoming customer tư vấn cases describing the issue

validated that the issues were limited to the Ubuntu versions described above.

There were multiple mitigation and remediation steps, several of which were completed in partnership with Canonical / Ubuntu:

This bug and a potential fix have been highlighted on the Canonical / Ubuntu website, which we encouraged impacted customers to read (linked above).

For impacted Azure VM instances, we recommended that customers reboot the VM(s) or, if reboot was not an option, run a

script to fix it. Azure provided a template script but encouraged customers to test and modify the script as needed before applying: https://github.com/mcgov/az_scripts/blob/main/az_fix_dns_resolve.sh.

For the impact to AKS nodes, our AKS team developed an automatic detection and remediation solution then rolled this out across all regions. This resolved the vast majority of customers,

others required manual mitigations through tư vấn.

Additional downstream impact to other services was addressed through similar remediation steps to address the bug in their specific Ubuntu version, so some of these services recovered prior to the final mitigation time above.

How are we making incidents like this less likely or less impactful?

Already completed:

Improved monitoring of AKS data plane for alerts on the uptick of errors.

Reviewed AKS monitoring algorithms to help ensure detection and alerting upon similar VM errors that were experienced in this scenario.

Established an improved escalation path to Canonical during outages and include that in our internal Technical Service Guides (TSGs).

Short term:

AKS will take full control over the security patch mechanism, versus shared control with Canonical today. This includes additional testing and controlled release of these patches done

directly by AKS. (Estimated completion: December 2022).

Medium term:

For IaaS VMs, we are working to engage with Canonical to run dedicated tests on proposed packages before they are published for Azure users.

Longer-term:

AKS will establish a process with Canonical to close the testing gap for the upgrade scenario. (Estimated completion: March 2023).

AKS will provide customers with maintenance window control and environment staging for these

patches when they are deemed safe to release. (Estimated completion: December 2023).

How can our customers and partners make incidents like this less impactful?

AKS customers can ensure their nodes are up to date and exclusively leveraging Microsoft’s supply chain by using automatic node image upgrade:

https://docs.microsoft.com/en-us/azure/aks/auto-upgrade-cluster.

For IaaS VMs, customers can choose to configure VMs to use the following service to get Ubuntu patches here: https://docs.microsoft.com/en-us/azure/virtual-machines/automatic-vm-guest-patching. This service will limit

the blast radius of VMs seeing the patches from Ubuntu and do an orchestrated update of the VMs. The service also provides health monitoring and to detect issues, users can provide app health signals through app health extension. Please refer to: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-health-extension.

How can we make our incident communications more useful?

Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2TWN-VT0

8/27

Post Incident Review (PIR) – Datacenter power sự kiện – West US 2

(Tracking ID MMXN-RZ0)

What happened?

Between 02:47 UTC on 27 Aug 2022 and 02:00 UTC on 28 Aug 2022, a subset of customers experienced failures when trying to access resources hosted in the West US 2 region. Although initially triggered by a utility power outage that affected all of our datacenters in the region, the vast majority of our backup power systems performed as designed to prevent impact. Failures of a small number of backup power systems led to

customer impact in two datacenters. Most customers were recovered by 07:00 UTC on 27 Aug 2022, but small subsets of resources required manual recovery – with the final set being brought back trực tuyến by 02:00 UTC on 28 Aug 2022.

During this impact window, several downstream Azure services that were dependent on impacted infrastructure also experienced issues – including Storage, Virtual Machines, App Services, Application Insights, Azure Database for PostgreSQL, Azure Red Hat

OpenShift, Azure Search, Azure SQL DB, Backup, Data Explorer, ExpressRoute, and NetApp Files.

What went wrong, and why?

On August 27 at 02:47 UTC, we identified a power sự kiện that caused impact to a number of storage and compute scale units in the West US 2 region. The West US 2 region is made up of 10+ datacenters, spread across three Availability Zones on multiple campuses. During this sự kiện, the whole region experienced a utility power outage,

impacting all datacenters in the region. A failure on major distribution lines caused at least two substations to lose power. That resulted in loss of utility power across a broad area that included all three Availability Zones in the West US 2 region.

In all datacenters except two, our backup power systems performed as designed, transitioning all infrastructure to run briefly on batteries and then on generator power. But in two separate datacenters, two unique but unrelated issues

occurred that prevented some of the servers in each datacenter from transitioning to generator power. Since these two datacenters were in two different Availability Zones, customers may have been impacted by both.

In the first datacenter, impact was caused when a small number of server rack level Uninterruptible Power Supply (RUPS) systems failed to stay trực tuyến during the transition to generator, creating a momentary loss of power to the servers. These servers were immediately re-energized

once backup generators started and supported the load.

In the second datacenter, several Primary UPS systems (approximately 12% of the total UPS systems in the datacenter) failed to tư vấn the load during the transition to generator, due to UPS battery failures. As a result, the downstream servers lost power until the UPS faults could be cleared and put back trực tuyến with utility supply.

The initial trigger to this sự kiện was when a high voltage static wire (used to help protect

transmission lines against lightning strikes) failed. When the static wire failed, it created a voltage surge on the 230kV lines, causing breakers at two substations (approximately 30 miles apart) within the utility power grid to open. The root cause of the static wire failures is still under investigation by the utility provider.

How did we respond?

This sự kiện was first detected by our EPMS (Electrical Power Monitoring System) in West US 2, which in turn

notified our datacenter team of the utility loss issue, and then of equipment failure issues. While the vast majority of datacenters transitioned to backup power without issue, two specific datacenters experienced different UPS issues described above that prevented a full transition to backup power sources.

Due to the nature of this sự kiện, the team followed our Emergency Operations Procedure (EOP) to manually restore Mechanical, Electrical, Plumbing (MEP) equipment to its operational

state. Once the MEP was returned to an operational state, the racks began to recover. The Public Utility Department (PUD) was able to close their breakers and restore utility power to our datacenters by 03:48 UTC. This enabled the datacenter teams to begin the recovery of the affected equipment and restoration of power to the impacted racks. By 04:46 UTC, power was fully restored to all affected racks, and services continued their recovery.

Four Azure Storage scale units were impacted by

the power loss (one Standard, two Premium, one Ultra Disk scale unit) resulting in the data hosted on these becoming inaccessible until power was restored and the scale units recovered to healthy states. The Standard Storage scale unit was fully available by 07:45 UTC, although the vast majority of clients would have seen availability restored by 06:05 UTC. The two Premium Storage scale units were restored by 0510 UTC. Due to a software bug (the fix for which is already in our deployment

pipeline) a small subset of disk requests (<0.5%) may have encountered further errors through 07:30 UTC. Due to a combination of hardware failures and software bugs, the Ultra Disk scale unit was not fully available until 21:40 UTC on 8/28. The majority of the data (> 99.9%) was available by 05:15 UTC on 08/27.

Impacted Azure compute scale sets were brought back trực tuyến – mostly automatically after storage recovered, but a subset of infrastructure and customer VMs required manual

mitigations to ensure they came back trực tuyến successfully. VMs that were using the Trusted Launch feature, in particular, did not automatically recover and required engineering team intervention to restore – all of these VMs were restored to a functional state by 00:20 UTC on 8/28.

How are we making incidents like this less likely or less impactful?

Already completed:

We have completed detailed inspections on our generator systems and all generators

are in good operating condition.

UPS systems have been inspected and all components are operating and functioning per design specifications. This inspection highlighted that battery replacement is required.

After reviewing the entire line up, batteries have been replaced in the UPS units that experienced failures.

Short term:

Complete deployment of the Storage software fix for the bug that caused a small tail of errors following Premium Storage scale

unit recovery.

Complete deployment of the software fix for the trusted VM feature, which caused some VMs not to come back trực tuyến automatically after storage recovery.

Longer term:

We are working on several platform, telemetry, and process improvements that will reduce Ultra Disk replica recovery time.

We are improving our VM migration times to healthy hosts for faster recovery. This includes telemetry improvements to identify long running/stuck migration

operations to identify issues more quickly.

Microsoft uses multiple equipment vendors and designs – findings will be reviewed against our global fleet and, where necessary, applied beyond the impacted datacenters.

Process failure mode effects analysis (PFMEA) review of the processes utilized during the sự kiện, applying lessons learned and improving our methodology. This includes assessing human touch points, working to engineer out or automate systems for smoother transition or

recovery. Findings will also be applied to our Tabletop/GameDay exercises, ensuring team members are familiar and prepared to respond.

How can customers and partners make incidents like this less impactful?

While Availability Zones are designed in ways to reduce correlated failures, they can still occur. We encourage customer and partner Business Continuity & Disaster Recovery (BCDR) plans to include the ability to failover between regions, in case of

a region-wide incident. While the likelihood of failure decreases with the magnitude of the failure, it never goes to zero: https://docs.microsoft.com/azure/availability-zones/cross-region-replication-azure

Consider which are the right Storage redundancy options for your critical applications. Geo-redundant storage (GRS) enables account level failover in case the primary

region endpoint becomes unavailable: https://docs.microsoft.com/azure/storage/common/storage-redundancy

More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:

https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency

Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

https://aka.ms/ash-alerts

How can we make our incident communications more useful?

We are piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/MMXN-RZ0

8/18

Post Incident Review (PIR) – Azure Key Vault – Provisioning Failures (Tracking ID YLBJ-790)

What happened?

Between 16:30 UTC on 18 Aug 2022 and 02:22 UTC on 19 Aug 2022, a platform issue caused Azure offerings such as Bastion, ExpressRoute, Azure Container Apps, Azure ML, Azure Managed HSM, Azure Confidential VMs, Azure Database Services (MySQL – Flexible Server, Postgres- Flexible Server, PostgreSQL – Hyperscale) to experience provisioning failures

globally. This issue impacted customers that relied on provisioning of certificates as part of provisioning of an Azure resource. This write-up is a Post Incident Review (PIR) we are providing to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn from this and improve.

What went wrong, and why?

The requesting authority for Azure Key Vault (the underlying platform, on which all the described services rely for the creation of

certificate resources) was experiencing high latency and volume of requests. This resulted in provisioning failures for the impacted services, as those services were not able to acquire certificates within the expected time. During the incident, a backend service that Azure Key Vault relies on, became unhealthy due to an unexpected spike in traffic during scheduled hardware maintenance, which caused a build-up of requests in the queue, resulting in high latencies to fulfil new certificate

creation requests.

How did Microsoft respond?

We developed and deployed a hotfix to increase the throughput, created new queues for request processing, and drained the queue of accumulated requests to alleviate the overall latency and process requests as expected.

How is Microsoft making incidents like this less likely, or at least less impactful?

• In the short-term, we are implementing request caps and partitioning the request

queues, to help prevent lasting failures in the service in similar scenarios.

• We are also reviewing the backend capacity and gaps in the maintenance process that led to the loss of availability during this maintenance operation.

• Based on our learning from this incident, we are implementing improvements to our health monitoring and operational guidance that would help reduce the time to detect similar issues and allow us to address similar issues before customers experience

impact.

• In the longer term, we are working to add fine-grained distributed throttling and portioning, to add additional isolation layers to the backend of this service, which will minimize impact in similar scenarios.

• Finally, we will work to add more Availability Zones and fault domains in all layers of the stack, along with automatic failover to the service, to help prevent disruption to customer workloads.

How can we make our incident communication more

useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/YLBJ-790

8/12

Post Incident Review (PIR) – Azure Communication Services – Multiple Regions (Tracking ID YTYN-5T8)

What happened?

Between 18:13 UTC on 12 Aug 2022 and 03:30 UTC on 13 Aug 2022, customers using Azure Communication Services (ACS) may have experienced authentication failures, or failures using our APIs. As a result, multiple scenarios may have been impacted including SMS, Chat, E-Mail, Voice & Video scenarios, Phone Number Management, and Teams-ACS Interop.

What went wrong, and why?

An Azure resource provider provides the ability for customers to create and maintain

resources, in this case, for ACS. The ACS resource provider utilizes backend Cosmos DB instances for resource metadata persistence. Prior to the incident, an increased volume of data-plane related requests was made by the resource provider to the database, which met database throughput limits. At 18:00 UTC on 12 Aug 2022, to meet the increased demand of requests, the database processing capacity was increased. This change in database capacity inadvertently exposed a latent code bug for the

resource provider, which resulted in a functional difference in the number of database results being returned against what could be processed by the resource provider. ACS is a globally distributed service and the metadata being retrieved was required for routing calls across different regions for the authentication process. This resulted in ACS authentication failures, and subsequently caused SMS, Chat, Voice & Video, Phone Number Management, and Teams-ACS Interop scenarios to fail.

How did we respond?

Automated alerting indicated several failures for different ACS API requests made by customers. We immediately investigated with multiple engineering teams, however understanding the nature of the issue took time because specific fields used for debugging Cosmos DB issues were not being logged for successful queries. Due to the service configuration, a rollback of the change to the database instance would not have been supported. Once the

underlying issue was identified, we developed code fixes to resolve the issue. We validated and deployed the fix using our Safe Deployment Practices, in phases. The hotfix was fully rolled out at 03:30 UTC on 13 Aug 2022, with customers reporting successful operation shortly thereafter.

How are we making incidents like this less likely or less impactful?

Completed:

We’ve completed code updates to address the latent bug and help ensure the resource provider

can process all results in similar scenarios.

We’ve added additional logging of backend database requests for the ACS resource provider, to ensure improved traceability in future.

We have added additional gates for database configuration updates and hardening for our processes when applying such updates. We have mirrored all production configuration templates in our pre-production environment to allow validation of configuration updates before they get deployed to

production.

We have completed additional Failure Mode Analyses (FMA) across different ACS features. We have created repair items for a resilient service architecture to improve failure recovery time.

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question

survey: https://aka.ms/AzPIR/YTYN-5T8

July 2022

7/29

Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)

What happened?

Between 08:00 UTC and until 13:20 UTC on 29 July 2022, customers may have experienced connectivity issues such as network drops, latency, and/or degradation when attempting to access or

manage Azure resources in multiple regions.

The most significant impact would have been experienced in the following regions – Brazil South, Canada Central, East Asia, East US, East US 2, France Central, nhật bản East, Korea Central, North Central US, South Africa North, South Central US, Southeast Asia, West Europe, and West US. Customers in other regions may have seen an intermittent impact when accessing resources across the Microsoft wide area network (WAN).

What went wrong,

and why?

Starting at 08:00 UTC on 29 July, the Azure WAN began to experience a sudden and significant increase of traffic, upwards of 60 Tbps in additional traffic compared to the normal levels of traffic carried on the network.

While the sự kiện was detected immediately and automated remediation was triggered, the substantial increased bursts of traffic occurring throughout the sự kiện affected the ability of automated mitigations to continue providing the necessary relief to

the network. WAN routers then became overloaded and dropped network packets, which resulted in network connectivity issues experienced by some customers.

This sự kiện included impact to both intra-region and cross-region traffic over various network paths, which included ExpressRoute.

Our investigation of this sự kiện continues, ensuring our diagnosis of contributing factors is complete and mitigations for this class of incident are finalized. The remaining workstreams are expected to be

finished within two months. We will update the status page and Azure Service Health when completed.

How did we respond?

We have several detection and mitigation algorithms that were triggered automatically around 08:00 UTC when an increased burst of traffic occurred. The volume of traffic surges continued to substantially increase, reaching 10-15 times greater than any traffic volume experienced on the network prior.

While the mitigation mechanisms were

successfully triggered to load balance and throttle the traffic bursts to help prevent impact, the significance of traffic on the WAN routers resulted in these mechanisms to take longer to alleviate the traffic surges and restore traffic back to normal levels.

By 13:20 UTC, traffic levels returned to normal as network telemetry confirmed packet drops had reduced to standard levels, which is when customers would have seen resource and service network health restored.

How are

we making incidents like this less likely or less impactful?

We are implementing service repairs because of this incident, including but not limited to:

Already Completed:

Additional alerting for specific packet drops signature caused by significant traffic bursts.

Work in Progress:

Improvements to network device capabilities to help reduce packet drops when handling significant traffic bursts.

Changes to the network design for

traffic spike detection to help reduce the time to mitigate for similar events.

Improve network incident response playbook to better streamline preventative actions performed for similar events.

Apply additional layers of network throttling to help protect the network reliability when increased traffic surges occur.

How did we communicate with impacted customers?

Starting around 11:00 UTC, we began to receive some reports of a potential

emerging issue. As signals continued to gradually increase, we posted an initial statement to the Azure status page at 11:52 UTC.

Delays in communications via Service Health in the portal were primarily due to challenges gauging the extent of impact and affected regions as limited telemetry of the networking sự kiện developing did not clearly indicate a viable scope of impact. Though other signals via internal and external reports indicated a likely platform sự kiện ongoing, the disparity of

signals deterred targeted notifications until a broad networking issue was determined.

Communications were sent via Azure Service Health for Azure services that started to report impact, where were later determined to be affected by the networking sự kiện. With further analysis and evidence of regional impact confirmed, broad targeted communication was sent to customers region-wide for the identified affected regions by 13:28 UTC.

Between 13:28 and 15:32 UTC, communications were

sent to the customers of additional impacted regions identified.

By 15:32 UTC, we began reporting recovery via the status page and Service Health, but monitoring and preventative workstreams persisted, which we continued to report until the necessary preventative workstreams were completed by 19:52 UTC.

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis)

format.

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/7SHM-P88

7/21

Post Incident Review (PIR) – SQL Database – West Europe (Tracking ID 3TBL-PD8)

What happened?

Between 03:47 UTC and 13:30 UTC on 21 Jul 2022, customers using SQL Database and SQL Data Warehouse in West Europe may have experienced

issues accessing services. During this time, new connections to databases in this region may have resulted in errors or timeouts. Existing connections would have remained available to accept new requests, however if those connections were terminated and then re-established, they may have failed.

New connections to the region and related management operations began failing from 03:47 UTC, partial recovery began at 06:12 UTC, with full mitigation at 13:30 UTC. Although we initially did not

declare mitigation until 18:45 UTC, a thorough impact analysis confirms that failure rates had returned to pre-incident levels earlier. No failures that occurred after 13:30 UTC were directly as a result of this incident.

During this impact window, several downstream Azure services that were dependent on the SQL Database service in the region were also impacted – including App Services, Automation, Backup, Data Factory V2, and Digital Twins.

Customers that had configured active

geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica – more information can be found here https://docs.microsoft.com/en-us/azure/azure-sql/database/active-geo-replication-overview?view

What went wrong, and why?

For context, connections to the Azure SQL Database service

are received and routed by regional gateway clusters. Each region has multiple gateway clusters for redundancy – traffic is distributed evenly between the clusters under normal operations, and automatically rerouted in case one of the clusters becomes unhealthy. Each gateway cluster has a persisted cache of metadata about each database in the system, that is used for connection routing. These caches are used for scaling-out gateway nodes, to avoid contention on a single source of metadata. There

are multiple caches per gateway cluster and each node will fetch data from any of the caches that is available. The West Europe region has two gateway clusters, and each of these clusters has two persisted metadata caches.

An operator error led to an incorrect action being performed in close sequence on all four persisted metadata caches. The action resulted in a configuration change that made the caches unavailable to the regional gateway processes. This resulted in all regional gateway

processes in West Europe becoming unable to access connection routing metadata, leading to the regional incident from 03:47 UTC. New connections would have failed as the gateways were not able to read routing metadata, but connections that were already established would have continued to work. Management operations on server and database resources would also have been impacted, as some workflows also rely on connection routing.

A secondary impact of the issue was that our internal

telemetry service in the West Europe region became overloaded with queries. This caused the telemetry ingestion to fall behind by a few hours and telemetry queries were also timing out. The telemetry issues contributed to delays in automatically notifying impacted customer subscriptions via Azure Service Health.

As some customers were receiving automatic notifications of impact within 15 minutes, we assumed that the notification pipeline was working as designed. It was later in the sự kiện

when we understood that communications were not reaching all impacted subscriptions. As a result, we broadened our communications to every customer in the region and published an update on the Azure status page.

Additionally, Automatic failover for anyone who had setup failover groups with auto-failover configuration was also impacted due to telemetry issues (manual failover was not impacted).

How did we respond?

This regional incident was detected by our

availability monitors, and we were on the investigation bridge within 13 minutes of customer impact. We understood the issue to the action that was performed erroneously and determined a way to reverse it. Another option would have been to rebuild entirely new caches – but it was determined that this rebuild would take much longer than fixing the caches in-place, so we proceeded to formulate the method to revive the caches in-place.

On applying this initial mitigation, the caches came back

up, which resulted in a partial recovery of the incident at 06:18 UTC. While success rates improved significantly at this point (~60%) the recovery was considered ‘partial’ due to two reasons. Firstly, a timing issue in applying mitigation caused gateways in one of the two clusters to cache incorrect cache connection strings. Secondly, the metadata caches were not receiving updates for changes that happened while the caches were unavailable.

The first issue was mitigated by restarting all

the gateway nodes in the cluster, which needed to be done at a measured pace to avoid overloading the recovering metadata caches. As the restarts progressed, we saw success rates continue to improve, steadily reaching 97% around 07:58 UTC, once all restarts had completed. At this point connections to any database that had not undergone changes (i.e., service tier updates) during the incident would have been successful.

The last step was to determine which persistent cache entries were

stale (missed updates) and refresh them to a consistent state. We developed and executed a script to refresh cache entries, with the initial refreshes being done manually while the script was being developed. The success rate recovered to 99.9% for the region at 11:10 UTC. We then proceeded to identify and mitigate any residual issues, and also started the process to confirm recovery with customers and downstream impacted Azure services.

Based on login success rate telemetry, the incident

mitigation time was determined to be 13:30 UTC. Mitigation communications were sent out to all impacted customers at 19:16 UTC, after a thorough validation that no residual impact remained.

How are we making incidents like this less likely or less impactful?

We are implementing a number of service repairs as a result of this incident, including but not limited to:

Completed:

Programmatically blocking any further executions of the action that led to

the metadata caches becoming unavailable.

In progress:

Implementing stronger guardrails on impactful operations to prevent human errors like the one that triggered this incident.

Implementing in-memory caching of connection routing metadata in each gateway process, to further increase resiliency and scalability.

Implementing throttling on telemetry readers to prevent ingestion from falling behind.

Removing dependency of automatic-failover on

telemetry system.

Investigating other service resiliency repairs as determined by our internal retrospective of this incident, which is ongoing.

How can our customers and partners make incidents like this less impactful?

Customers who had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica.

More guidance for recovery in regional failure scenarios is

available at: https://docs.microsoft.com/en-us/azure/azure-sql/database/disaster-recovery-guidance

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

https://aka.ms/AzPIR/3TBL-PD8

Which Azure service can identify all global service issues?

You’ll want to dive into Azure Monitor to see if you can identify any issues on your end. Azure Monitor gives you a way to collect, analyze, and act on all the telemetry from your cloud and on-premises environments. These insights can help you maximize the availability and performance of your applications.

Which feature within Azure alerts you to service issues that happen in Azure itself?

Azure Service Health notifies you about Azure service incidents and planned maintenance so you can take action to mitigate downtime.

Which Azure Management service informs you about problems with the Azure platform itself and upcoming maintenance events?

Azure Service Health helps you stay informed and take action when Azure service issues like outages and planned maintenance affect you. It provides you with a personalized dashboard that can help you understand issues that may be impacting resources in your Azure subscriptions.

Which tool within Azure is comprised of Azure status service health and Resource health?

Azure Service Health is a suite of experiences that provide personalized guidance and tư vấn when issues in Azure services are or may affect you in the future. Azure Service Health is composed of Azure status, the service health service, and Resource Health.

Tải thêm tài liệu tương quan đến nội dung bài viết Which Azure service can identify all global service issues whether or not they are in use within your account?

Cryto

Eth

Azure Advisor

Azure alerts

Data observability Azure

Review Which Azure service can identify all global service issues whether or not they are in use within your account? ?

Một số hướng dẫn một cách rõ ràng hơn về Review Which Azure service can identify all global service issues whether or not they are in use within your account? tiên tiến và phát triển nhất .

ShareLink Tải Which Azure service can identify all global service issues whether or not they are in use within your account? miễn phí

Bạn đang tìm một số trong những Chia SẻLink Download Which Azure service can identify all global service issues whether or not they are in use within your account? Free.

#Azure #service #identify #global #service #issues #account

Which Azure service can identify all global service issues whether or not they are in use within your account? 2021

Thủ Thuật về Which Azure service can identify all global service issues whether or not they are in use within your account? Mới Nhất

September 2022

Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0)

Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)

August 2022

Post Incident Review (PIR) –

Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

Post Incident Review (PIR) – Datacenter power sự kiện – West US 2

(Tracking ID MMXN-RZ0)

Post Incident Review (PIR) – Azure Key Vault – Provisioning Failures (Tracking ID YLBJ-790)

Post Incident Review (PIR) – Azure Communication Services – Multiple Regions (Tracking ID YTYN-5T8)

July 2022

Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)

Post Incident Review (PIR) – SQL Database – West Europe (Tracking ID 3TBL-PD8)

Which Azure service can identify all global service issues?

Which feature within Azure alerts you to service issues that happen in Azure itself?

Which Azure Management service informs you about problems with the Azure platform itself and upcoming maintenance events?

Which tool within Azure is comprised of Azure status service health and Resource health?

Review Which Azure service can identify all global service issues whether or not they are in use within your account? ?

ShareLink Tải Which Azure service can identify all global service issues whether or not they are in use within your account? miễn phí

Đăng nhận xét

Biểu mẫu liên hệ

Which Azure service can identify all global service issues whether or not they are in use within your account? 2021

Thủ Thuật về Which Azure service can identify all global service issues whether or not they are in use within your account? Mới Nhất

September 2022

Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0)

Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)

August 2022

Post Incident Review (PIR) –Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

Post Incident Review (PIR) – Datacenter power sự kiện – West US 2(Tracking ID MMXN-RZ0)

Post Incident Review (PIR) – Azure Key Vault – Provisioning Failures (Tracking ID YLBJ-790)

Post Incident Review (PIR) – Azure Communication Services – Multiple Regions (Tracking ID YTYN-5T8)

July 2022

Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)

Post Incident Review (PIR) – SQL Database – West Europe (Tracking ID 3TBL-PD8)

Which Azure service can identify all global service issues?

Which feature within Azure alerts you to service issues that happen in Azure itself?

Which Azure Management service informs you about problems with the Azure platform itself and upcoming maintenance events?

Which tool within Azure is comprised of Azure status service health and Resource health?

Review Which Azure service can identify all global service issues whether or not they are in use within your account? ?

ShareLink Tải Which Azure service can identify all global service issues whether or not they are in use within your account? miễn phí

Đăng nhận xét

Biểu mẫu liên hệ

Post Incident Review (PIR) –

Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

Post Incident Review (PIR) – Datacenter power sự kiện – West US 2

(Tracking ID MMXN-RZ0)