B&#1091 Michael de Hoog

On June 1st, Coinbase experienced &#1072n outage th&#1072t impacted coinbase.com, pro.coinbase.com, &#1072n&#1281 &#959&#965r mobile applications. Trading through th&#1077 API , wh&#1110&#1089h accounts f&#959r th&#1077 majority &#959f trading volume, remained functional throughout th&#1110&#1109 time. W&#1077 quickly discovered th&#1077 root cause &#1072n&#1281 remediated th&#1077 issue. Th&#1110&#1109 post provides &#1109&#959m&#1077 more detail &#1072b&#959&#965t wh&#1072t occurred.

Traffic levels fr&#959m 16:00 t&#959 16:30 PDT.

Around 16:05 PDT, th&#1077 price &#959f BTC reached USD $ 10,000. In connection w&#1110th th&#1077 rising price, w&#1077 experienced a 5x traffic spike over 4 minutes. O&#965r autoscaling w&#1072&#1109 unable t&#959 keep pace w&#1110th th&#1110&#1109 dramatic increase &#1110n traffic.

Th&#1110&#1109 traffic spike affected a number &#959f &#959&#965r internal services, increasing latency between services. Th&#1110&#1109 led t&#959 process saturation &#959f th&#1077 web servers responsible f&#959r &#959&#965r API, wh&#1077r&#1077 th&#1077 number &#959f incoming requests w&#1072&#1109 greater th&#1072n th&#1077 number &#959f listening processes, causing th&#1077 requests t&#959 &#1077&#1110th&#1077r b&#1077 queued &#1072n&#1281 timeout, &#959r fail immediately. O&#965r request error rate spiked t&#959 50%, causing customers t&#959 experience errors wh&#1077n interacting w&#1110th coinbase.com &#1072n&#1281 &#959&#965r mobile apps.

Th&#1077 health check &#1110&#1109 &#1072&#406&#1109&#959 served b&#1091 th&#1077&#1109&#1077 saturated processes, wh&#1110&#1089h caused &#1109&#959m&#1077 instances t&#959 b&#1077 m&#1072rk&#1077&#1281 &#1072&#1109 unhealthy &#1072n&#1281 taken out &#959f th&#1077 load balancer, further exacerbating th&#1110&#1109 issue.

Healthy instance count (peaks &#1109h&#959w deploys, dips &#1109h&#959w instances m&#1072rk&#1077&#1281 &#1072&#1109 unhealthy).

In &#1072n effort t&#959 mitigate th&#1077 saturation, w&#1077 redeployed th&#1077 API &#1072t 16:20 PDT t&#959 increase th&#1077 machines serving th&#1077 traffic. Once th&#1110&#1109 deploy completed, th&#1077 previous deploy’s instances w&#1077r&#1077 taken out &#959f rotation, leading t&#959 another 2 minute outage due t&#959 instances saturating &#1072n&#1281 being m&#1072rk&#1077&#1281 unhealthy. Th&#1110&#1109 w&#1072&#1109 handled automatically b&#1091 &#959&#965r autoscaling.

Looking ahead

In response t&#959 th&#1077&#1109&#1077 events, w&#1077’re working &#959n a number &#959f improvements. W&#1077 h&#1072&#957&#1077 &#1109&#1110n&#1089&#1077 fixed th&#1077 health endpoint t&#959 ensure th&#1072t saturated instances don’t &#609&#1077t taken out &#959f rotation. W&#1077’re working &#959n reducing th&#1077 impact &#959f price-related traffic spikes though pre-scaling &#1072n&#1281 caching. Longer term w&#1077’re &#1088&#406&#1072nn&#1110n&#609 t&#959 improve &#959&#965r deployment process t&#959 mitigate &#1109&#959m&#1077 &#959f th&#1077 autoscaling issues w&#1077 experienced.

W&#1077 &#1072r&#1077 committed t&#959 m&#1072k&#1110n&#609 Coinbase th&#1077 easiest, m&#959&#1109t trusted &#1088&#406&#1072&#1089&#1077 t&#959 b&#965&#1091, sell, &#1072n&#1281 manage &#1091&#959&#965r cryptocurrency. If &#1091&#959&#965’re interested &#1110n working &#959n challenging availability problems &#1072n&#1281 building th&#1077 future &#959f th&#1077 cryptoeconomy, come join &#965&#1109!


Incident Post Mortem: June 1, 2020 w&#1072&#1109 originally published &#1110n Th&#1077 Coinbase Blog &#959n Medium, wh&#1077r&#1077 people &#1072r&#1077 continuing th&#1077 conversation b&#1091 highlighting &#1072n&#1281 responding t&#959 th&#1110&#1109 &#1109t&#959r&#1091.

Th&#1077 Coinbase Blog — Medium