PXL Vision - Services are unavailable – Détails de l'incident

Tous les systèmes sont opérationnels

Services are unavailable

Résolu
Panne majeure
Signalé le il y a 7 moisA duré environ 5 heures

Concernés

PXL Ident

Panne majeure depuis 10:42 AM à 3:55 PM

PXL Ident API

Panne majeure depuis 10:42 AM à 3:55 PM

PXL Check

Panne majeure depuis 10:42 AM à 3:55 PM

PXL Cloud

Panne majeure depuis 10:42 AM à 3:55 PM

Notification Service

Panne majeure depuis 10:42 AM à 3:55 PM

Mises à jour
  • Postmortem
    Postmortem

    Summary

    On November 5, 2025, some customers experienced degraded performance or temporary unavailability of our platform due to a power-related incident in one availability zone at our third-party infrastructure provider.

    The issue was entirely on the provider side, triggered by an unplanned power interruption during scheduled facility maintenance, compounded by downstream power-distribution and cabling failures. No software bug, configuration change, or action on our side contributed to the incident.

    What Happened

    Our automated monitoring systems detected the sudden unavailability of our applications. We immediately engaged our on-call team, confirmed no changes on our side, and opened a critical support case with the provider.
    The root cause was an unexpected power feed interruption during scheduled facility maintenance, combined with power-balancing and cabling issues on a subset of the provider’s equipment. This caused some hypervisor nodes to shut down.
    Once the provider restored stable power, we performed controlled restarts of all our affected instances. No configuration changes were made on our side.

    Impact

    Customers experienced up to 5 hours of outage or partial degradation.

    Root Cause: During scheduled maintenance, the provider experienced an unexpected loss of one primary power feed. Redundant feeds should have covered the gap, but a combination of power-balancing logic errors and previously unknown cabling issues caused a subset of racks to lose power completely, resulting in hard shutdowns of the affected hypervisors.

    What We Learned

    This incident highlighted that even with redundancy measures at the infrastructure-provider level, a severe physical failure in one availability zone can still cause an extended impact on workloads tied to that zone.

    Our detection, alerting, and incident coordination processes worked as intended:

    • The issue was detected quickly.

    • On-call staff engaged immediately.

    • Customer communication was initiated and maintained during the incident.

    We will continue to strengthen these areas and further reduce the time to mitigation and time to full recovery.

    Actions Taken

    Immediately during and after the incident, we:

    • Performed controlled restarts of all affected instances with priority on customer-facing workloads.

    • Ran post-restart health checks across the impacted services to confirm full recovery.

    • Verified that there was no data loss or data corruption.

    • Conducted an internal review of monitoring, alerting, and on-call procedures for this type of provider-side failure.

    Moving Forward

    Although this incident originated at our infrastructure provider, we treat its impact on you as our responsibility. To further reduce risk and improve resilience, we are:

    • Strengthening monitoring and alerting for provider-level issues
      Adding and fine-tuning health probes and alerts to detect infrastructure-level anomalies even earlier and provide more granular visibility during such events.

    • Refining incident response playbooks
      Updating and expanding our runbooks for power- and infrastructure-related incidents so that our teams can apply the fastest, safest mitigation steps consistently.

    • Improving communication during incidents
      Reviewing and optimizing how and when we send updates during outages, with the goal of providing clearer, more frequent, and more actionable information while an incident is ongoing.

    • Deepening collaboration with our infrastructure provider
      Working with the provider to ensure that the remediation measures on their side are effective and to improve early warning and escalation paths for future maintenance operations.

    We will continue to evaluate and progressively enhance our overall resilience strategy, including additional redundancy and failover mechanisms where they bring the most benefit to our customers.

    Closing Note

    We sincerely apologize for the disruption this incident caused. While the origin of the issue was outside our direct control, its impact on you is not. Our focus is on reducing both the likelihood and the impact of similar events in the future and on continuously strengthening the reliability of our platform.

    Thank you for your patience and continued trust. If you have any questions, would like more details about this incident, or want to discuss your own continuity requirements, our support team is here to help.

    Sincerely,

    PXL Team

  • Résolu
    Résolu

    We are pleased to confirm that the technical issue impacting our third-party service provider has been fully resolved, and all services have been restored and are operating normally.

    You should now be able to resume your operations without any further disruption.

    Thank you once again for your incredible patience and understanding while our partner worked to fix this unforeseen issue. We sincerely apologize for any inconvenience this disruption caused.

    Please monitor your systems and do not hesitate to contact us immediately if you encounter any lingering issues.

    Sincerely,

    PXL Team

  • Mettre à jour
    Mettre à jour

    The services are partially available, and the recovery process is in progress. We cannot yet see full stability in the services, and recommend enabling a maintenance window for your users for the time period until we can confirm the full stability of our services.

    We are continuously monitoring their progress. We do not yet have a firm Estimated Time of Resolution (ETR) to share, but we will notify you immediately once service is restored or an ETR is confirmed.

    Thank you for your continued patience.

    Sincerely,

    PXL Team

  • Surveillé
    Surveillé

    Our third-party service provider is still actively working on a solution to resolve the technical issue.

    We are continuously monitoring their progress. We do not yet have a firm Estimated Time of Resolution (ETR) to share, but we will notify you immediately once service is restored or an ETR is confirmed.

    Thank you for your continued patience.

    Sincerely,

    PXL Team

  • Détecté
    Détecté

    Our services are currently experiencing a temporary disruption and are unavailable.

    This is due to an unforeseen technical issue impacting our third-party service provider. We understand that this may cause inconvenience, and we sincerely apologize for the disruption to your operations.

    Our team is in constant communication with the service provider, and they are working urgently to resolve the issue and restore full functionality as quickly as possible. We are closely monitoring the situation and will provide you with an update as soon as we have a firm estimated time of resolution or once service is restored.

    We appreciate your patience and understanding during this time.

    Thank you for your continued understanding.