Services are unavailable

Mises à jour

Postmortem
novembre 26, 2025 à 08:28
Postmortem
novembre 26, 2025 à 08:28
Summary
On November 5, 2025, some customers experienced degraded performance or temporary unavailability of our platform due to a power-related incident in one availability zone at our third-party infrastructure provider.
The issue was entirely on the provider side, triggered by an unplanned power interruption during scheduled facility maintenance, compounded by downstream power-distribution and cabling failures. No software bug, configuration change, or action on our side contributed to the incident.
What Happened
Our automated monitoring systems detected the sudden unavailability of our applications. We immediately engaged our on-call team, confirmed no changes on our side, and opened a critical support case with the provider.
The root cause was an unexpected power feed interruption during scheduled facility maintenance, combined with power-balancing and cabling issues on a subset of the provider’s equipment. This caused some hypervisor nodes to shut down.
Once the provider restored stable power, we performed controlled restarts of all our affected instances. No configuration changes were made on our side.
Impact
Customers experienced up to 5 hours of outage or partial degradation.
Root Cause: During scheduled maintenance, the provider experienced an unexpected loss of one primary power feed. Redundant feeds should have covered the gap, but a combination of power-balancing logic errors and previously unknown cabling issues caused a subset of racks to lose power completely, resulting in hard shutdowns of the affected hypervisors.
What We Learned
This incident highlighted that even with redundancy measures at the infrastructure-provider level, a severe physical failure in one availability zone can still cause an extended impact on workloads tied to that zone.
Our detection, alerting, and incident coordination processes worked as intended:
- The issue was detected quickly.
- On-call staff engaged immediately.
- Customer communication was initiated and maintained during the incident.
We will continue to strengthen these areas and further reduce the time to mitigation and time to full recovery.
Actions Taken
Immediately during and after the incident, we:
- Performed controlled restarts of all affected instances with priority on customer-facing workloads.
- Ran post-restart health checks across the impacted services to confirm full recovery.
- Verified that there was no data loss or data corruption.
- Conducted an internal review of monitoring, alerting, and on-call procedures for this type of provider-side failure.
Moving Forward
Although this incident originated at our infrastructure provider, we treat its impact on you as our responsibility. To further reduce risk and improve resilience, we are:
- Strengthening monitoring and alerting for provider-level issues
  Adding and fine-tuning health probes and alerts to detect infrastructure-level anomalies even earlier and provide more granular visibility during such events.
- Refining incident response playbooks
  Updating and expanding our runbooks for power- and infrastructure-related incidents so that our teams can apply the fastest, safest mitigation steps consistently.
- Improving communication during incidents
  Reviewing and optimizing how and when we send updates during outages, with the goal of providing clearer, more frequent, and more actionable information while an incident is ongoing.
- Deepening collaboration with our infrastructure provider
  Working with the provider to ensure that the remediation measures on their side are effective and to improve early warning and escalation paths for future maintenance operations.
We will continue to evaluate and progressively enhance our overall resilience strategy, including additional redundancy and failover mechanisms where they bring the most benefit to our customers.
Closing Note
We sincerely apologize for the disruption this incident caused. While the origin of the issue was outside our direct control, its impact on you is not. Our focus is on reducing both the likelihood and the impact of similar events in the future and on continuously strengthening the reliability of our platform.
Thank you for your patience and continued trust. If you have any questions, would like more details about this incident, or want to discuss your own continuity requirements, our support team is here to help.
Sincerely,
PXL Team
Résolu
novembre 05, 2025 à 15:55
Résolu
novembre 05, 2025 à 15:55
We are pleased to confirm that the technical issue impacting our third-party service provider has been fully resolved, and all services have been restored and are operating normally.
You should now be able to resume your operations without any further disruption.
Thank you once again for your incredible patience and understanding while our partner worked to fix this unforeseen issue. We sincerely apologize for any inconvenience this disruption caused.
Please monitor your systems and do not hesitate to contact us immediately if you encounter any lingering issues.
Sincerely,
PXL Team
Mettre à jour
novembre 05, 2025 à 15:09
Mettre à jour
novembre 05, 2025 à 15:09
The services are partially available, and the recovery process is in progress. We cannot yet see full stability in the services, and recommend enabling a maintenance window for your users for the time period until we can confirm the full stability of our services.
We are continuously monitoring their progress. We do not yet have a firm Estimated Time of Resolution (ETR) to share, but we will notify you immediately once service is restored or an ETR is confirmed.
Thank you for your continued patience.
Sincerely,
PXL Team
Surveillé
novembre 05, 2025 à 12:03
Surveillé
novembre 05, 2025 à 12:03
Our third-party service provider is still actively working on a solution to resolve the technical issue.
We are continuously monitoring their progress. We do not yet have a firm Estimated Time of Resolution (ETR) to share, but we will notify you immediately once service is restored or an ETR is confirmed.
Thank you for your continued patience.
Sincerely,
PXL Team
Détecté
novembre 05, 2025 à 10:42
Détecté
novembre 05, 2025 à 10:42
Our services are currently experiencing a temporary disruption and are unavailable.
This is due to an unforeseen technical issue impacting our third-party service provider. We understand that this may cause inconvenience, and we sincerely apologize for the disruption to your operations.
Our team is in constant communication with the service provider, and they are working urgently to resolve the issue and restore full functionality as quickly as possible. We are closely monitoring the situation and will provide you with an update as soon as we have a firm estimated time of resolution or once service is restored.
We appreciate your patience and understanding during this time.
Thank you for your continued understanding.

PXL Vision - Services are unavailable – Détails de l'incident

Summary

What Happened

Impact

What We Learned

Actions Taken

Moving Forward

Closing Note