Neutron is considered one of the most challenging OpenStack components to operate, and while we at Rackspace have gained vast knowledge operating it, there are still situations where we end up creating a new solution for an emerging problem.
Rackspace operates the OpenStack Private Cloud for a large industrial company to accelerate their IT service delivery into the cloud, and recently this customer reported a performance issue with the Neutron metadata service in a Neutron Linux bridge ML2 managed environment. The Neutron metadata service implements a proxy in between the OpenStack instance and the Nova and Neutron services to provide Amazon AWS EC2 style metadata.
Performance issues, resulting in client timeouts or service unavailability of this service directly impacted cloud user workload, which led to application unavailability. The issue was compounded by operating over 1000 instances inside one layer 2 network.
This Neutron service is important for user instances for various reasons including:
- Cloud Placement Decisions (What is my public IP, etc.)
- User Scripts and SSH Key injection into the boot process (typically via cloud-init)
The way Neutron provides this service is by wrapping into a Linux network namespace and running a HTTP proxy server, the neutron-ns-metadata-proxy. Using network namespaces is common practice to separate routing domains in Linux, allowing custom firewall (iptables) and routing processing compared to the host OS.
What happened to this service?
Our customer was reporting response times longer than 30 seconds for any request to the Neutron metadata service. While initial debugging on the user instances revealed that metadata requests got intercepted by a security appliance, excluding the standard metadata IP,
169.254.169.254from the proxy configuration via
export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com,169.did not solve the issue.
At this point, we knew the issue was related to the Neutron service or the background service it uses, mainly Nova API (compute) and RabbitMQ (the OpenStack message bus).
Looking at the request the Neutron service handles, we identified an unusual pattern in the frequency of metadata requested per instance and realized that the configuration management Chef was requesting the metadata, beyond the standard expected behavior of OpenStack instances during boot or reboot requests.
At this point, we knew the Chef ohai plugin “EC2 metadata” was used to query the neutron metadata and inefficiencies inside this Chef plugin were known in regards to HTTP connection handling, mainly the lack of supporting HTTP persistence.
Continuing the research on the Neutron service and looking for ways to improve response times, we identified that the neutron-ns-metadata-proxy service was only capable of opening 100 Unix sockets to the neutron-metadata-agent. These sockets are used to talk to the Neutron metadata-agent across the Linux Network namespace, without opening additional TCP connections internally, mainly as performance optimization.
Unable to explain the 100 connections limit at first, especially in absence of Neutron backend problems (Neutron server) or Nova API issues, we began looking at the Neutron source code and found a related change in the upstream code.
The Neutron commit was adding an option to parameterize the WSGI threads, WSGI is used as web server gateway for Python, but was also lowering the default limit from 1000 to 100 inside the same commit. This crucial information was absent in the Neutron release notes.
More importantly, we just found our 100 Unix sockets limit.
This also explained the second observation that the connections to the Neutron metadata service got queued and caused the large delay in response times. This queueing was a result of using a network event library eventlet and greenlet combination, a typical way of addressing non-blocking I/O in the Python environment.
So what comes next?
We are currently solving the problem in multiple ways.
The imminent problem should be resolved with a Chef-ohai plugin fix as proposed per Chef pull request #995 which finally introduces persistent HTTP connections and drastically reduces the need for parallel connections. Initial testing and results are encouraging.
More importantly, the Neutron community has re-implemented the neutron-ns-metadata-proxy with HAProxy to address performance issues. The community needs to verify if the issue is still occurring.
Alternatively, there are Neutron network design decisions that can assist with these problems. For example, one approach is to reduce the size of a Neutron L2 network to smaller than 23 Bits prefix size, which allows Neutron to scale out the metadata service.
This approach allows the option to create multiple Neutron routers, scaling out the Neutron metadata service onto other Neutron agents, where one router is only responsible for serving the Neutron metadata requests. This is especially the situation when the configuration option
enable_isolated_metadata is set to
True and project/tenant networks are attached to Neutron routers.
At Rackspace, we see the OpenStack market maturing and adoption is accelerating. But operating a private cloud is extremely difficult because it’s complex and it’s hard to find the deep expertise needed to resolve challenges like this one. Many organizations of all sizes prefer to consume private clouds as a service, so they can focus on their core business.=
As OpenStack continues to mature, you need a partner who helps customers eliminate the operational challenges and complexity. The team at Rackspace is the answer. We have successfully scaled OpenStack to thousands of nodes and operate private clouds for some of the largest companies in the world.
To learn more and ask questions about whether private cloud as a service might be a good fit for your organization, take advantage of a free strategy session with a private cloud expert — no strings attached. SIGN UP NOW.