My first postmortem

My first postmortem

The following is the incident report for the malfunctioning of the web site on February 5 2029.

Issue Summary:

From 7:10 AM to 7:31AM there was an issue with the request to the main website randomly returning a 500 Internal Server Error due to the bug in the php index file of one of the server's in the load balancer cluster. This was estimated to have affected 40% of the customers utilizing the website at that particular period.

Timeline (all times Pacific Time):

  • 6:30 AM - new server setup.

  • 7:00 AM - servers reboot.

  • 7:00AM - the new server added to the load balancer cluster.

  • 7:10 AM - malfunctioning begins.

  • 7:10 AM - Pagers alerted On-call teams.

  • 7:15 AM - Recovery beginning.

  • 7:30 AM - Problem found

  • 7:31 AM - Problem fixed

  • 7:35 AM - The would system is fully back online.

Root Cause:

At 7:00 AM a new server was added to the load balancer cluster without testing. The newly added server has a bug in the php index file which was the misspelling of a reference to one of the dependency files. This made request to the site hosted by the server randomly return a 500 error.

Resolution and recovery:

At 7:10 AM after the malfunctioning began the monitoring service immediately alerted the On-call teams trough PagerDuty.

At 7:15 AM after realizing that the source of the problem was from the newly added server. We started trying so many tool's in other to find the root cause of the problem, we then decide to use the strace tool to inspect traffic going to the newly added server, at 7:30 AM we realized that there was a bug in the php index file in the web servers root directory. By 7:31 AM the problem was fixed.

The server was them tested between 7:31 AM and 7:34 AM to avoid any further issue.

The new server was then added back to the load balancer cluster and the whole server was fully functional by 7:35 AM.

Corrective and Preventative Measures:

After the issue was solved , to prevent the issue from happening again a new testing environment was created in other to test new servers that are to be added to the load balancer cluster.