Thủ Thuật về Which Azure service can identify all global service issues whether or not they are in use within your account? Mới Nhất
Pro đang tìm kiếm từ khóa Which Azure service can identify all global service issues whether or not they are in use within your account? 2022-10-05 03:39:20 san sẻ Thủ Thuật về trong nội dung bài viết một cách Mới Nhất.
The AWS Global Cloud Infrastructure is the most secure, extensive, and reliable cloud platform, offering over 200 fully featured services from data centers globally. Whether you need to deploy your application workloads across the globe in a single click, or you want to build and deploy specific applications closer to your end-users with single-digit millisecond latency, AWS provides you the cloud infrastructure where and when you need it.
With millions of active customers and tens This page contains root cause analyses (RCAs) of previous service issues, each retained for 5 years. From November 20, 2019, this included RCAs for all issues about which we communicated publicly. From June 1, 2022, this includes RCAs for broad issues as described in our September 20229/7 Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0)What happened? Between 16:10 and 19:55 UTC on 07 Sep 2022, a subset of customers using Azure Front Door (AFD) experienced What went wrong and why? The AFD platform automatically balances traffic across our global network of edge sites. When there is a failure in any of our edge sites or an edge site becomes overloaded, traffic is automatically moved to other healthy edge sites in other regions where we have fallback capacity. It is because of this design that customers and end users don’t experience any issues in case of localized or Between 15:15 and 16:44 UTC we observed 3 unusual traffic spikes for one of the domains hosted on AFD.
How did we respond? We have automatic protection mechanisms in such events During the third spike, the platform protection mechanisms were partially effective, mitigating around 40% of the traffic. This significantly helped to limit global impact. For a larger duration, 8.5% of the overall AFD service, concentrated in some regions, was impacted by this issue. Some customers may have seen As our telemetry alerted us regarding impact on availability, we manually intervened. The first step was that we took manual action to further block the attack traffic. In addition, we expedited the AFD load balancing process which then enabled auto-recovery systems to work as designed. The systems worked by ensuring the most efficient load distributions in regions where How are we making incidents like this less likely or less impactful? Although the AFD platform has built-in resiliency and capacity, we must continuously strive to improve through these lessons learned. We have a few previously planned repair items that
How can we make our incident communications more useful? Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template. You can rate this PIR and provide any feedback using our quick 3-question 9/7 Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)What happened? Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may Downstream Azure services that rely on Cosmos DB also experienced impact during this window – including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview. What went wrong and why? Cosmos DB load balances workloads How did we respond? Our monitors alerted us of the impact on Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted. How are we making incidents like this less likely Already completed:
In progress:
How can customers make incidents like this less impactful? Consider configuring your accounts to be globally distributed – enabling multi-region for your critical accounts would allow for a customer-initiated failover during regional service incidents like this one. For more details, refer to: More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: How can we make our incident communications more useful? We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format. You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/3TPC-DT8 August 20228/30 Post Incident Review (PIR) –What happened? Between 06:00 UTC on 30 Aug 2022 and 16:00 UTC on 31 Aug 2022, customers running Ubuntu 18.04 (bionic) Virtual Machines (VMs) who had Ubuntu Unattended-Upgrades enabled received a systemd version that resulted in Domain Name System (DNS) resolution errors. This issue was confined to Ubuntu version 18.04, but impacted all Azure regions including public and sovereign clouds. Downstream What went wrong, and why? At 06:00 UTC on 30 August 2022, a Canonical Ubuntu security update was published – so Azure VMs running Ubuntu 18.04 (bionic) with unattended-upgrade enabled started to tải về and install the new packages, including systemd version 237-3ubuntu10.54. This led to a loss of their DNS configurations due to a race-condition bug: The manifestation of this bug was triggered due to the combination of this and a previous update. This bug only affects systems using a driver name to identify the proper Network Interface Card (NIC) in their network configuration, which is why this issue impacted Azure uniquely and not other major cloud providers. This resulted in DNS When unattended-upgrades are enabled, security updates are automatically downloaded and applied once per day by default. Considering their criticality, security updates like these do not go through our Safe Deployment Practices (SDP) process. However, we are reviewing this process to ensure that How did we respond? Multiple Azure teams detected the issue shortly after the packages were published via production alerts, including our AKS and Azure Container Apps service teams. Upon investigation, we identified the root cause as the bug in Ubuntu mentioned above, and began engaging other teams to explore appropriate mitigations. During this time, incoming customer tư vấn cases describing the issue There were multiple mitigation and remediation steps, several of which were completed in partnership with Canonical / Ubuntu:
How are we making incidents like this less likely or less impactful? Already completed:
Short term:
Medium term:
Longer-term:
How can our customers and partners make incidents like this less impactful?
How can we make our incident communications more useful? Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template. You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2TWN-VT0 8/27 Post Incident Review (PIR) – Datacenter power sự kiện – West US 2What happened? Between 02:47 UTC on 27 Aug 2022 and 02:00 UTC on 28 Aug 2022, a subset of customers experienced failures when trying to access resources hosted in the West US 2 region. Although initially triggered by a utility power outage that affected all of our datacenters in the region, the vast majority of our backup power systems performed as designed to prevent impact. Failures of a small number of backup power systems led to During this impact window, several downstream Azure services that were dependent on impacted infrastructure also experienced issues – including Storage, Virtual Machines, App Services, Application Insights, Azure Database for PostgreSQL, Azure Red Hat What went wrong, and why? On August 27 at 02:47 UTC, we identified a power sự kiện that caused impact to a number of storage and compute scale units in the West US 2 region. The West US 2 region is made up of 10+ datacenters, spread across three Availability Zones on multiple campuses. During this sự kiện, the whole region experienced a utility power outage, In all datacenters except two, our backup power systems performed as designed, transitioning all infrastructure to run briefly on batteries and then on generator power. But in two separate datacenters, two unique but unrelated issues In the first datacenter, impact was caused when a small number of server rack level Uninterruptible Power Supply (RUPS) systems failed to stay trực tuyến during the transition to generator, creating a momentary loss of power to the servers. These servers were immediately re-energized In the second datacenter, several Primary UPS systems (approximately 12% of the total UPS systems in the datacenter) failed to tư vấn the load during the transition to generator, due to UPS battery failures. As a result, the downstream servers lost power until the UPS faults could be cleared and put back trực tuyến with utility supply. The initial trigger to this sự kiện was when a high voltage static wire (used to help protect How did we respond? This sự kiện was first detected by our EPMS (Electrical Power Monitoring System) in West US 2, which in turn Due to the nature of this sự kiện, the team followed our Emergency Operations Procedure (EOP) to manually restore Mechanical, Electrical, Plumbing (MEP) equipment to its operational Four Azure Storage scale units were impacted by Impacted Azure compute scale sets were brought back trực tuyến – mostly automatically after storage recovered, but a subset of infrastructure and customer VMs required manual How are we making incidents like this less likely or less impactful? Already completed:
Short term:
Longer term:
How can customers and partners make incidents like this less impactful?
How can we make our incident communications more useful? We are piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template. You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/MMXN-RZ0 8/18 Post Incident Review (PIR) – Azure Key Vault – Provisioning Failures (Tracking ID YLBJ-790)What happened? Between 16:30 UTC on 18 Aug 2022 and 02:22 UTC on 19 Aug 2022, a platform issue caused Azure offerings such as Bastion, ExpressRoute, Azure Container Apps, Azure ML, Azure Managed HSM, Azure Confidential VMs, Azure Database Services (MySQL – Flexible Server, Postgres- Flexible Server, PostgreSQL – Hyperscale) to experience provisioning failures What went wrong, and why? The requesting authority for Azure Key Vault (the underlying platform, on which all the described services rely for the creation of How did Microsoft respond? We developed and deployed a hotfix to increase the throughput, created new queues for request processing, and drained the queue of accumulated requests to alleviate the overall latency and process requests as expected. How is Microsoft making incidents like this less likely, or at least less impactful? • In the short-term, we are implementing request caps and partitioning the request • We are also reviewing the backend capacity and gaps in the maintenance process that led to the loss of availability during this maintenance operation. • Based on our learning from this incident, we are implementing improvements to our health monitoring and operational guidance that would help reduce the time to detect similar issues and allow us to address similar issues before customers experience • In the longer term, we are working to add fine-grained distributed throttling and portioning, to add additional isolation layers to the backend of this service, which will minimize impact in similar scenarios. • Finally, we will work to add more Availability Zones and fault domains in all layers of the stack, along with automatic failover to the service, to help prevent disruption to customer workloads. How can we make our incident communication more We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format. You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/YLBJ-790 8/12 Post Incident Review (PIR) – Azure Communication Services – Multiple Regions (Tracking ID YTYN-5T8)What happened? Between 18:13 UTC on 12 Aug 2022 and 03:30 UTC on 13 Aug 2022, customers using Azure Communication Services (ACS) may have experienced authentication failures, or failures using our APIs. As a result, multiple scenarios may have been impacted including SMS, Chat, E-Mail, Voice & Video scenarios, Phone Number Management, and Teams-ACS Interop. What went wrong, and why? An Azure resource provider provides the ability for customers to create and maintain How did we respond? Automated alerting indicated several failures for different ACS API requests made by customers. We immediately investigated with multiple engineering teams, however understanding the nature of the issue took time because specific fields used for debugging Cosmos DB issues were not being logged for successful queries. Due to the service configuration, a rollback of the change to the database instance would not have been supported. Once the How are we making incidents like this less likely or less impactful? Completed:
How can we make our incident communications more useful? We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format. You can rate this PIR and provide any feedback using our quick 3-question July 20227/29 Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)What happened? Between 08:00 UTC and until 13:20 UTC on 29 July 2022, customers may have experienced connectivity issues such as network drops, latency, and/or degradation when attempting to access or The most significant impact would have been experienced in the following regions – Brazil South, Canada Central, East Asia, East US, East US 2, France Central, nhật bản East, Korea Central, North Central US, South Africa North, South Central US, Southeast Asia, West Europe, and West US. Customers in other regions may have seen an intermittent impact when accessing resources across the Microsoft wide area network (WAN). What went wrong, Starting at 08:00 UTC on 29 July, the Azure WAN began to experience a sudden and significant increase of traffic, upwards of 60 Tbps in additional traffic compared to the normal levels of traffic carried on the network. While the sự kiện was detected immediately and automated remediation was triggered, the substantial increased bursts of traffic occurring throughout the sự kiện affected the ability of automated mitigations to continue providing the necessary relief to This sự kiện included impact to both intra-region and cross-region traffic over various network paths, which included ExpressRoute. Our investigation of this sự kiện continues, ensuring our diagnosis of contributing factors is complete and mitigations for this class of incident are finalized. The remaining workstreams are expected to be How did we respond? We have several detection and mitigation algorithms that were triggered automatically around 08:00 UTC when an increased burst of traffic occurred. The volume of traffic surges continued to substantially increase, reaching 10-15 times greater than any traffic volume experienced on the network prior. While the mitigation mechanisms were By 13:20 UTC, traffic levels returned to normal as network telemetry confirmed packet drops had reduced to standard levels, which is when customers would have seen resource and service network health restored. How are We are implementing service repairs because of this incident, including but not limited to: Already Completed:
Work in Progress:
How did we communicate with impacted customers? Starting around 11:00 UTC, we began to receive some reports of a potential Delays in communications via Service Health in the portal were primarily due to challenges gauging the extent of impact and affected regions as limited telemetry of the networking sự kiện developing did not clearly indicate a viable scope of impact. Though other signals via internal and external reports indicated a likely platform sự kiện ongoing, the disparity of Communications were sent via Azure Service Health for Azure services that started to report impact, where were later determined to be affected by the networking sự kiện. With further analysis and evidence of regional impact confirmed, broad targeted communication was sent to customers region-wide for the identified affected regions by 13:28 UTC. Between 13:28 and 15:32 UTC, communications were By 15:32 UTC, we began reporting recovery via the status page and Service Health, but monitoring and preventative workstreams persisted, which we continued to report until the necessary preventative workstreams were completed by 19:52 UTC. How can we make our incident communications more useful? We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/7SHM-P88 7/21 Post Incident Review (PIR) – SQL Database – West Europe (Tracking ID 3TBL-PD8)What happened? Between 03:47 UTC and 13:30 UTC on 21 Jul 2022, customers using SQL Database and SQL Data Warehouse in West Europe may have experienced New connections to the region and related management operations began failing from 03:47 UTC, partial recovery began at 06:12 UTC, with full mitigation at 13:30 UTC. Although we initially did not During this impact window, several downstream Azure services that were dependent on the SQL Database service in the region were also impacted – including App Services, Automation, Backup, Data Factory V2, and Digital Twins. Customers that had configured active What went wrong, and why? For context, connections to the Azure SQL Database service An operator error led to an incorrect action being performed in close sequence on all four persisted metadata caches. The action resulted in a configuration change that made the caches unavailable to the regional gateway processes. This resulted in all regional gateway A secondary impact of the issue was that our internal As some customers were receiving automatic notifications of impact within 15 minutes, we assumed that the notification pipeline was working as designed. It was later in the sự kiện Additionally, Automatic failover for anyone who had setup failover groups with auto-failover configuration was also impacted due to telemetry issues (manual failover was not impacted). How did we respond? This regional incident was detected by our On applying this initial mitigation, the caches came back The first issue was mitigated by restarting all The last step was to determine which persistent cache entries were Based on login success rate telemetry, the incident How are we making incidents like this less likely or less impactful? We are implementing a number of service repairs as a result of this incident, including but not limited to: Completed:
In progress:
How can our customers and partners make incidents like this less impactful? Customers who had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica. More guidance for recovery in regional failure scenarios is How can we make our incident communications more useful? We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format. You can rate this PIR and provide any feedback using our quick 3-question survey: Which Azure service can identify all global service issues?You’ll want to dive into Azure Monitor to see if you can identify any issues on your end. Azure Monitor gives you a way to collect, analyze, and act on all the telemetry from your cloud and on-premises environments. These insights can help you maximize the availability and performance of your applications. Which feature within Azure alerts you to service issues that happen in Azure itself?Azure Service Health notifies you about Azure service incidents and planned maintenance so you can take action to mitigate downtime. Which Azure Management service informs you about problems with the Azure platform itself and upcoming maintenance events?Azure Service Health helps you stay informed and take action when Azure service issues like outages and planned maintenance affect you. It provides you with a personalized dashboard that can help you understand issues that may be impacting resources in your Azure subscriptions. Which tool within Azure is comprised of Azure status service health and Resource health?Azure Service Health is a suite of experiences that provide personalized guidance and tư vấn when issues in Azure services are or may affect you in the future. Azure Service Health is composed of Azure status, the service health service, and Resource Health. |
Review Which Azure service can identify all global service issues whether or not they are in use within your account? ?
Một số hướng dẫn một cách rõ ràng hơn về Review Which Azure service can identify all global service issues whether or not they are in use within your account? tiên tiến và phát triển nhất .
ShareLink Tải Which Azure service can identify all global service issues whether or not they are in use within your account? miễn phí
Bạn đang tìm một số trong những Chia SẻLink Download Which Azure service can identify all global service issues whether or not they are in use within your account? Free.
#Azure #service #identify #global #service #issues #account