Fracttal One Service Access Degradation

Incident Report for Fracttal

Resolved

Between 20:10 UTC and 20:26 UTC, Fracttal One experienced intermittent access issues caused by a sudden and unexpected spike in concurrency across several core services. During this period, some users were unable to access the application, experienced slow response times, or encountered timeouts while performing operations.

The issue was fully mitigated by 20:26 UTC, and service performance has remained stable since.

⸻

Impact
• Intermittent access failures for a subset of users
• Slow response times across API endpoints
• Occasional HTTP 503/504 errors
• Degraded performance in login and dashboard initialization

No data integrity issues occurred. All stored information in AWS Europe (Paris) remained safe and unaffected.

⸻

Root Cause

A sudden increase in concurrent requests exceeded the expected capacity thresholds in one of the API layers responsible for authentication and initial workload distribution.

Key contributing factors:
1. A temporary saturation of application workers due to concurrency bursts.
2. Queue accumulation that caused slower processing and cascading delays.
3. Resource contention on specific tasks triggered by simultaneous requests.

The system’s autoscaling rules responded, but not quickly enough to absorb the initial load spike.

⸻

Mitigation Actions Taken
• Manually triggered scaling actions to increase available application workers.
• Redistributed traffic between service nodes to balance utilization.
• Cleared congested queues and reduced lock contention.
• Expanded concurrency thresholds to prevent similar saturation points.

Full service functionality was restored by 20:26 UTC.

⸻

Preventive Actions

To avoid recurrence, the following improvements will be implemented:
1. Increase worker pool capacity and adjust autoscaling triggers.
2. Optimize concurrency management inside the authentication and routing layer.
3. Expand monitoring thresholds and add alerts for early detection of saturation.
4. Introduce adaptive scaling policies for burst handling.
5. Evaluate further horizontal scaling of impacted services.

Posted Nov 18, 2025 - 15:00 GMT-05:00