Correction-Of-Error (COE) at Amazon — penalty or opportunity?

Aswin Rajeev
4 min readMar 24, 2024

--

Correction-Of-Errors (COEs) are quite infamous among the new joiners in Amazon. Those are the documents you get to write when you screw things up, which could then put you in spotlight for a while. Well, not quite so by intent, but often so in practice. Having a COE with your name might not be something you could boast about often.

This story is about how I got my first COE within 3 months of moving to a new team and how I could navigate it. And not just one, I had two COEs back-to-back in my name. It all started with the first change I made on a system that was completely new to me, after recently being reassigned to a new team.

A brief context around what led to the COE

Soon after moving into the new team, I was looking to pick some backend work after a successful stride ramping up and delivering on some UI (React) work. There were some non-compliant migrations in the team’s backlog for a service that was rarely touched. In short, the work involved migrating a few client connections to a different form of authentication and it felt to me like an easy first-pick. There were detailed instructions written in internal wikis (Yes, Amazon has tons of those) and it seemed straight-forward. So when this backlog item caught my eye, I decided to pick it.

The deployment of the changes in production happened a day before I went on a vacation outside India, after thorough testing in development environment and subsequent QA sign off in Gamma environment. Unknown to me at that time, the QA test plan did not cover several key features and hence the impact areas were not tested at all. The change broke in production, affecting two distinct features. The worst part was that we came to know about the issues after a significant time, once certain employees outside the team reported the issues.

The change broke the features because certain configuration for development environment was different in Gamma/Prod from Beta (development environment), which neither I nor the reviewers of the change knew. Had it been tested in Gamma, the issue would have been caught — but there also luck was not in favour. And luck was not the only one to blame, I could have done more diligence in diving-deep while making the change or verify the fix in Gamma/Prod myself.

Welcomed back with a COE, or two!

When I returned after the vacation, only one of the issue was reported. I was welcomed with the news that I have to write a COE. COEs are intended to investigate what caused the impact and propose how such issues could be averted in future — yes, it actually served a noble cause. My analysis revealed that there were gaps in the development processes we followed, in the quality assurance mechanisms and more importantly in our service metrics and alarms (I’ve written about these in detail in a previous article). Interestingly, the all the errors were masked as 4XX series issues in the metrics and never came to the notice of the development team through alarms.

By the time the COE was almost concluded and I started arresting the action items, the second issue got reported and warranted another COE. Considering the possible reputation damage (perhaps!) and the common trigger, I was given an option to merge the two issues in a single COE. However, the root cause for the failure and the steps to mitigate were largely different in these two cases. I was worried of the people being judgemental about my performance on having two back-to-back COEs immediately after joining the new team, but somewhere I felt it is better to treat these two separately to have a better remediation. My manager too shared this opinion, and we decided to write a separate COE, letting people judge and us rather focus on improving the system.

The aftermath

COEs are tracked at an organisation level and carried strict deadlines on closing the action items. It took almost two months of significant efforts to close all the gaps I found during the investigations. Since I was already working on a high-priority project, we had to manage the investigations and mitigations along with regular project work. And those times were indeed hectic!! Finally I was content with the bridging of all identified gaps and was lauded on the overall impact it had on the quality improvements, but the concerns of the repercussions this event would have on my career remained. It was around the annual appraisal cycle when all these happened, so I knew no-one was going to miss these lapses.

Things turned when the appraisal results were shared by my manager. Apparently the COE turned to a slight advantage by having some tangible impacts to show rather than having an uneventful term in the team. Mistakes were there, but the appraisers were apparently generous about it and it didn’t really ruin by ratings that badly. Moreover the learning opportunity that I had from this incident was enormous — in those two months, I almost became an expert in the aspects concerning that service and the mitigation of the concerned shortcomings.

Conclusion

Yes, COEs are still dreaded by several people in Amazon. But for me, it indeed worked to my advantage in the greater sorts of things. My manager and a colleague I deeply respect said almost same thing:

It’s an opportunity to showcase one’s ownership and relentless commitment to increasing the bar (an indicator of quality at Amazon). The damage that is done is done, what remains is how one can fix it and make sure it never happens again. That learning is invaluable and needs to be shared.

--

--

Aswin Rajeev

Software Development Engineer (SDE II) at Amazon with over 10 years of experience in software engineering. https://www.aswin.me