Platform-wide Outage
Incident Report for Syngency
Postmortem

All times in PST.

Introduction

At roughly 11:30pm on September 25th, Syngency experienced an outage which blocked the platform from accepting/responding to requests up until 10.45am on September 26th.
The following is a list of events that occurred throughout the incident, along with actions we will implement moving forward to mitigate such incidents occurring again in future.

Before walking through all of the events & steps taken during the outage, it would help if you have some context as to where/how Syngency runs. All services, requests, data and messaging are run on Amazon’s AWS Cloud computing platform. Chosen of course for its reliability/stability and proven track record. At time of writing, AWS handles roughly a third of all cloud computing services.

However, AWS is itself a behemoth, and based on the size of your service, requires dedicated/trained engineers to ensure all services are working at optimal levels. Because of this, numerous intermediary services are available to interact between AWS and the service you are providing. Here at Syngency, we currently are in use of one of these intermediary services called Convox.

Convox allows us to focus more on the application and feature development of the Syngency platform, leaving the handling of the infrastructure, networking and compute instances with this intermediary service. To date, this is how Syngency has operated without issue for the past 7 years. However, as laid out in the events that follow, this service directly let us down at this time and as part of our ongoing improvements, we will evaluate the use of this service moving forward.

Sequence of Events

11:30pm A configuration change was made to our intermediary service Convox to stop making “health check” requests. The check URL was invalid and health checks were already enabled in AWS as we understood.

11:35pm Code change took effect and instantly found Syngency requests were not fulfilled.

11:36pm After confirming unhandled requests, we began the rollback process. Convox allows us the ability to rollback to any previous version of the codebase to a stable instance. After code deploy, we may have bugs/errors in the codebase, which we in turn fix and move forward. But with unexpected/unforeseen production issues, the rollback option allows us to revert with minimum impact.

11:37pm On attempting rollback, we noticed that our server “Rack” was in “Unknown/Unresponsive” state. A “Rack” is the name given to all of the computer instances and resources needed for Syngency to run. In this case our “Production Rack”.

11:45pm By directly confirming with our AWS account, bypassing the intermediary Convox service, we realised that Convox themselves had completely and automatically deleted our “Production Rack” due to the configuration change. To be clear, this is incredulous in our opinion. We were shocked and we are still dealing with Convox to ensure this can’t happen again in future. In short, no rack should ever be automatically deleted without the owner's consent, action or request. This is an issue we will be raising with Convox in the coming days.

12:00am Regroup. We knew we needed to restore the service as soon as possible however we had an unforeseen problem. As we had been running Syngency with Convox for some years without issue, all of the deployment code used to interact with Convox was in their “Generation 1” format. Convox still supports the existing Generation 1 format, however, any new Rack creation would automatically use the new Generation 2 format. We had to make the decision as to whether we recreate our build code using the next Generation 2 format, or whether to bypass Convox altogether at this time and create directly on AWS. We decided to create with the next Generation 2 format as we felt this would be a better option in terms of moving forward and getting reliable/reproducible builds.

1:00am By this time we had adapted our build settings to suit the Generation 2 format, with services running inside of AWS using Convox. However, these services were not yet able to serve requests to the public. This in turn created a whole new issue for us. When Convox creates and deploys a new compute instance inside a Rack, Convox first performs a health/ready check to ensure the new instance can serve requests. Because our instances were not yet able to serve public requests, Convox would deem the newly deployed instance was not healthy/ready and terminate the instance. After termination, Convox would then proceed to re-create another instance for the instance it just terminated.

At this point, we were basically in an infinite loop with Convox creating our production services, seeing they were not yet ready, destroying them and then re-creating again. Any code changes we tried to make were rendered useless because new deployments were rejected while Convox deemed it was already in the middle of a deployment.

The only way to stop this was to once again delete the entire Rack and re-create with the new code changes. This is of course not an instant process, it takes time to re-create all of the available AWS instances to run the Syngency platform.

2:00am We had exhausted all possibilities to get a ready/healthy running Generation 2 app running on Convox. Having already reached out to the Convox support team via Twitter to get some form of communication, we contacted their technical support team for assistance. Unbelievably, the only support engineer available was based in London and had approximately 3 weeks experience on the job. As much as we tried to walk through the process with the support agent, we were unable to obtain any additional information or clues to get a running setup with Generation 2. However, the support agent did acknowledge that we could create another Generation 1 rack via another route, and therefore, we could try to re-introduce the existing Generation 1 build code and setup we had all along.

3:00am We now had instances built and running in AWS using the Generation 1 build versions, however we still had issues with Convox deeming that the instances were not in a ready/healthy state. So the issue of infinite loop and unable to make changes without re-creating the entire rack remained. This didn’t make any sense to us because the Generation 1 build code was identical to what we previously had. Either it should have always worked or never worked previously.

4:00am We had discovered a health/ready state fix that Convox would finally accept. We could now at least have public requests reaching the Syngency platform to serve and respond. Although we could respond to requests, we still had internal errors and exceptions so we were unable to resume normal operations. Because our initial production rack had been deleted by Convox, all security groups that were assigned to connect to our existing internal services, eg our database and in memory cache were being disallowed.

5:00am Our engineering team placed a support call with an AWS technician to consult on the best approach to rectifying these internal security/permissions.

6:30am We now had an active connection to our main database, however, we could still not connect with our in memory cache stores. These stores are responsible for caching database results, sending talent packages, storing talent/agency instagram image urls and confirming user session sign-in details. While we continued to get internal connections working with the in memory stores, we also refactored code, removing the in memory store so users could at least sign into Syngency. This was always going to be a temporary measure to re-introduce later.

7:00am New memory cache stores were created via Convox which in turn created the new security groups necessary.

7:30am Internally, the Syngency platform was operational, and we would now need to take the steps to reroute all internet traffic to the new server infrastructure.

9.45am Operation restored to Syngency websites and admin interfaces.

10.45am Microservices (such as Casting Calls, Gallery PDF generator, webhooks, and gallery downloads) were still affected by the outage at this stage. With the platform stable, our engineering team then dedicated its focus to bringing these services back online.

Next Steps

Having now worked through the outage, we have identified a number of key areas to improve moving forward. Listed below are action points that we will undertake immediately and in the coming weeks. We will update all customers as to the progress of each. We want you to know that we are accountable to ensuring a stable available service and that such events mitigated in future.

Confirm the validity of Convox

To date the Convox service has served Syngency well and enabled the engineering resource to focus solely on application and feature requirements to our customers. And to their credit, Convox has been performing behind the scenes exactly as required for some time now. However, now having dealt with a totally unacceptable and avoidable incident, putting in jeopardy relations with our valued customers, Convox now may no longer be a good fit for our current needs. We will now re-evaluate our infrastructure tooling, ensuring that we have in place a solid foundation that can accommodate the expected growth path of the Syngency platform as we move forward.

Status Page

In order to keep customers and partners updated with an incident/outage status, we have implemented a dedicated status page which will be updated while we work through any issues which may arise. We will look to pin this status page on our social networks, additionally placing a banner on the Syngency dashboard to alert you to the current status. In this way, the team responding to the incident won’t have to reply to customers individually and you will be provided with up to date information which you can directly access at any time.

Scheduled Outage Drills

As part of regular day to day business, we will plan schedules outage drills. Much like a fire drill, we will place the Syngency platform in a non-usable state and work through resuming normal operations. Each drill will introduce a different initial error state to work through. The process for diagnosing and fixing will be documented so that learnings can be shared. Any staff new to the engineering team will be required to read this documentation and fulfil outage drills as part of their on-boarding process. These drills will occur on a separate isolated Rack so will not have any effect on the production systems.

Scheduled Tooling Upgrades

The source code that runs the Syngency platform is constantly kept up to date with the latest releases to ensure there are no security vulnerabilities or bugs/defects that could disrupt our service. The same scheduled update process was not however, being applied to our internal tooling. Our internal tooling is responsible for the deployment of the Syngency platform onto our cloud service provider (AWS). In this case, we were using a Generation 1 format with Convox when we should have already migrated to Generation 2. We will now incorporate tooling upgrades as part of our day to day business requirements. These upgrades will also be a prerequisite to Scheduled Outage Drills. This will help ensure that our outage drills are based from the tooling versions that a real outage would encounter.

Ignore Caching in Critical Circumstances

Caching is used across Syngency to help improve user response times and reduce platform load at peak usage times. However, in some circumstances, we should ignore caching in favour of serving a valid response. For example, the user sign-in flow uses caching to store GeoIP information as part of our internal security measures. As our main services started to come back online, users were still unable to sign-in due to our auxiliary caching service not being available. In this case, we should ignore caching and allow the user to sign-in as normal. In this way, users can start to use the system and we can continue to bring back online the auxiliary services.

Separation of website service from admin service (medium/long-term)

For a number of our customers, their number one priority was that their public-facing website was unavailable, rather than being able to access the Syngency admin interface.
Currently, our customer websites exist on the same platform as the admin interface. This means when an issue affects one, it also affects the other. As a longer term goal, we are looking to remove this dependency so that customer websites will not be affected by any potential platform outage incidents.

Positives

Although we never want to repeat such incidents, there are some positive takeaways from this incident that we will continue to work on and promote. As follows,

No data loss

Although we were unable to accept or process requests, no previously stored data was lost or corrupted. In the worst case scenario with the loss of the database, we can restore from backup via daily snapshots.

99.85% Uptime

Having the Syngency platform deliver 99.85% uptime allows for just over 13 hours downtime per year. Our aim is of course to have no downtime whatsoever. But it is important to remember, in the bigger scheme of things, how available our service is and will continue to be.

Tighter Relations with our Service Providers

Being able to highlight issues and potential fixes to our providers strengthens our provider relations. In turn this increases response times when we raise issues affecting service levels.

Amazing Customers

For the most part, customers in direct contact for updates about the outage were very understanding. Much like when the power goes out at your home residence, there is not a lot you can do to help, but to be patient and trust that the issue is being addressed to the best of abilities. For that, we greatly thank you for your understanding.

Please contact our team via Intercom or email support@syngency.com should you have any further questions or concerns relating this incident.

Posted Oct 01, 2019 - 15:41 PDT

Resolved
Syngency is currently experiencing a major system-wide outage. This is affecting all admin interfaces, websites, and related services. Our engineering team is investigating the cause and will post a further update ASAP.
Posted Sep 25, 2019 - 23:30 PDT