Google will attempt to reduce the risk of unexpected problems in its production environment by rolling out smaller changes in future, following a post-mortem into an intermittent Compute Engine issue over the weekend.
The company's Compute Engine cloud service experienced packet loss and service performance issues for almost three-quarters of an hour over the weekend as a result of a configuration change in Google's network stack, the web services giant said.
Google engineers attempted to provide greater isolation between virtual machines (VMs) and projects by capping the amount of traffic allowed to individual VMs.
While the configuration change had been tested prior to production deployment, once introduced in a live system it caused some VMs to behave in an unexpected manner.
Google rolled back the configuration change and said it was investigating why the testing it conducted did not adequately predict the performance of the VM isolation mechanism while in a production environment.
The company said from now on it would also roll out changes to small parts of the production environment first so as to reduce the risk in case of unexpected behaviour.
The Compute Engine problem followed an 84-minute issue over the weekend in which some App Engine applications experienced high error rates when accessing Google Application Programming Interfaces (APIs) over hyper text transport protocol (HTTP).
Google traced the problem to its HTTP load balancing fabric, which received an increase in normal traffic.
That traffic increase exceeded the load balancing fabric's provisioned capacity, and automatically triggered Google's denial of service (DoS) defenses.
This in turn redirected some of the excess incoming traffic to a CAPTCHA test that Google uses to foil bots.
As the clients did not expect to be redirected to a CAPTCHA, several issued automated retries which further increased the traffic and excacerbated the problem, Google said.