2023-03-13 Incident Response - Scheduler Outage
Incident & Impact
Beginning the morning of March 13th, ChannelMix experienced an error that resulted in failures to add scheduled ETL tasks to process nodes. The service outage affected all data processing nodes across ChannelMix ETL queues which caused morning SLA’s to be missed.
Chronology
Once the issue was detected, The ChannelMix data processing team troubleshooted the issue for 15 minutes. At 7:46 am the ChannelMix platform team was alerted the scheduler was down. At 8:00 am, Clients were notified by alerts in both ChannelMix Control Center and Help Center that there was a delay in data processing for the day.
Python croniter package upgrade successfully deployed and started scheduled job processing at around 10:00 am, additional workers started on queues to speed up the process. Jobs had finished processing by about 4:00 pm at which time the system had completely recovered.
Fault
Downtime was directly caused by a Python package, croniter, that deals with time (crons) becoming outdated and erroring after the annual DST time-change.
Root Cause
The Python croniter package was out of date. We believe the time change for Daylight Savings Time caused the old croniter code to break and start throwing the observed errors, bringing down the scheduler.
Recurrence
We believe, with high confidence, that this specific issue has a very low chance of recurrence, since it was caused by a Python package that deals with time (crons) being out-of-date and the annual DST time change occurred just before the incident.
An outage related to an out-of-date Python package has a high probability of recurrence if packages are not updated in a timely manner. We have a process in place to track, prioritize and deal with code dependency chain updates. That said, we always have to triage, since packages never stop becoming outdated (or insecure) and updates can require a lot of effort (due to cascading to other code and/or regression testing).
An outage related to time / time-change has a medium probability of recurrence - time-related issues are not uncommon and packages and applications that depend on accurate time are very sensitive to failure if code is not updated to deal with time / zones properly.
In closing
At ChannelMix, we understand how important uninterrupted service is to our customers and we are committing to verifiably improve the resilience of our services based on all that we learned during and after this outage.
Comments
Post is closed for comments.