CloudBees is the only cloud company focused on servicing the complete develop-to-deploy lifecycle of Java web applications in the cloud – where customers do not have to worry about servers, virtual machines or IT staff. The CloudBees platform today includes DEV@cloud, a service that lets developers take their build and test environments to the cloud, and RUN@cloud, which lets teams seamlessly deploy these applications to production on the cloud.
In the last two weeks, CloudBees faced two outages: one related to its main infrastructure provider, Amazon, and the other to a bug in the Linux kernel. This report aims at sharing in greater detail how users of the CloudBees services have been impacted and how CloudBees has reacted to fix those problems. Note: It is possible to know at all times the status of
CloudBees’ services by visiting CloudBees’ status page.
(1) AWS Outage
A few days ago, Amazon Web Services (AWS) faced an
outage on one of its data centers (a zone) in its US-EAST region. You can read
the details of what happened on AWS’s RSS feed or on ZDNet, among others.
The most notable element of this outage is that even
though only one zone was physically impacted by this outage (a region is
comprised of multiple independent zones), an API endpoint (responsible for EBS
storage) running on that zone didn’t fail cleanly, this in turn prevented the
similar API endpoint in the other zones to handle any mutable request.
Consequently, while the servers that had already been started before the outage
on the three other zones kept functioning, most AWS API calls requesting some
new resource (such as starting a new server backed by an EBS store) would fail
in all zones as well. Since all
4 zones are theoretically supposed to work independently (i.e., a problem in one
zone should not break the API in all 4 zones), this outage was pretty bad and a
number of cloud vendors that had relied on a multi-zone architecture to improve
the resiliency of their service got caught off-guard by this outage. This
multi-zone single-point-of-failure condition is clearly a weakness that AWS has
to urgently fix.
What impact did it have on CloudBees users and
customers? It depends on the services and SLA they had subscribed to. Let’s go
through the list.
Subversion and Maven repositories weren’t impacted by the AWS outage: those
services are relatively “static” (i.e. they don’t require much elasticity from
the underlying IaaS), hence didn’t require many calls to the AWS API (which was
down, for the most part), so those services went through nicely. No repository data was lost.
Concerning our Jenkins
as a Service offering, it is being split in multiple zones. For
customers that had their master hosted on the impacted zone, we would have
restarted those instances on another zone, but, as discussed above, the AWS API
was unable to serve the required requests for any of the other zones (not just
the impacted one), and prevented us from doing that migration. For the Jenkins
masters that were not hosted in the impacted zone, some builds were slower to
start, since we couldn’t start any new build machines in any of the 4 zones
(again, while only one zone was impacted, the API went down on all 4 zones). In
the last few months, we have worked to highly reduce our dependency on the AWS API
for our Jenkins as a Service offering, this effort helped us to not be too sensitive
to the outage despite AWS API not being functional. No Jenkins data was lost.
On our PaaS deployment platform, customers that happened
to run on the impacted zone and who hadn’t set their application as “highly available”
within CloudBees may have had their instance impacted. Free applications were
most vulnerable to downtime since those are not clustered. Customers who were
operating under an HA setup were able to keep running throughout the outage.
However, since AWS’ API was dead for all 4 zones, we weren’t able to restart
new nodes in a healthy zone to bring the cluster back to its cruising size.
Customers running in Europe or another data center (HP, etc.) weren’t impacted
at all by the AWS outage: CloudBees core PaaS servers, which are fully HA,
remained up and running at all time [*]. It is worth noting that CloudBees is not
only able to replicate applications amongst multiple zones, but also amongst
multiple regions, which offers a very high level of availability. No
application data was lost.
Concerning the CloudBees’ MySQL service, customers who
had standalone MySQL instances running on the impacted zone couldn’t access
their data anymore (but the application accessing it was probably impacted, as
well). Customers who had opted for the CloudBees clustered database offering and
who had their master node in the impacted zone were impacted and couldn’t
write/update data (we needed to have access to a working EBS API to perform the
switch). Clustered customers with their master in another zone weren’t
impacted. Last but not least, customers
with a database running in a healthy zone weren’t impacted by this outage. No databases were lost.
We are constantly working to improve our resilience to
IaaS outages. As a result, the customers who opted for CloudBees’ HA features
didn’t suffer from AWS’s recent outage. If anything, this outage should remind
our customers that they can decide, on an application by application basis,
what type of SLA they want by selecting an appropriate service level, already
offered by CloudBees to deliver high availability.
(2) Leap Second Linux Bug
A little after midnight on July 1 GMT, the CloudBees monitor
alerts indicated unusually high CPU levels across a number of servers. We
narrowed the problem to an apparent Linux kernel issuethat
resulted in CPU exhaustion after the leap second took effect. We responded by
restarting the affected EC2 instances, which restored normal operations for
most of our users' applications.
were cases, however, where some applications needed to be moved to new AWS EC2
instances. The migration to the new set of EC2 instances was rolled out
gradually over a three-hour window to minimize the impact to our users. At this
time, our environment is operating normally, but we continue to monitor it
**[Update: We have determined that some applications with clustered (HA) configurations
experienced downtime when their applications
were restarted. Our investigation found that this was due to a router
configuration problem which resulted in requests failing to make it to the running application instances.
Subsequent app restarts resolved the issue, but this did result in
downtime for some "HA" configured apps. Apps configured with New Relic
uptime monitoring were properly alerted when these apps became