DevOps Transformation
| Oleksiy Volkov |
04 January 2020
Microservices are taking our industry by storm – the benefits of loose coupling, process isolation and independent deployments are becoming quite clear to both developers and broader enterprises. What is often less clear (and is frequently forgotten or outright ignored) is the impact that managing and keeping the lights on for hundreds of discrete microservices has on DevOps and site reliability teams. The traditional approach of using standardised code libraries for enabling network resilience or run-time monitoring typically doesn’t scale in polyglot environments, and often runs into maintainability and enforcement issues. Thus, the alternative – isolating and separating these concerns into an external, out of process runtime component centrally managed by DevOps teams – became the core concept behind modern service meshes.
The rough perimeter of the problem space tackled by modern service meshes is usually defined as follows:
(1) reliable service (both external and cross-service) communications
(2) dynamic traffic routing and shaping
(3) continuous and easy observability and monitoring
(4) configuration management and security.
Let’s break these down further so we can better understand some of the practical use-case scenarios and benefits from using a service mesh.
The key building block of all service meshes today is a service proxy which is typically implemented using a “side car” pattern. A side car proxy is effectively a separate component that gets deployed in its own process and/or container space and takes over communications between a microservice and external world. This is usually achieved seamlessly via port redirection and typically requires zero code changes on the application side (one of the main reasons that service meshes today are more popular then network resilience libraries of yore). A sample traffic flow for a data plane and side car proxy pattern is shown below.
A service proxy effectively takes over both north/south and east/west communications and provides a data plane, which can enforce the following patterns to ensure application resilience in case of an adversarial network, compute or traffic situation:
Circuit breaking - preventing global/cascading outages caused by local slowdowns in the stack by quickly disconnecting appropriate services or components.
Automatic retries - ensuring that intermittent downstream issues do not cause massive functional failures in the consuming applications.
Health checks - ensuring that services are up and operational and being able to detect and remove bad services/nodes from load balancing rotations.
Backpressure/rate limits - being able to dynamically respond to and control inbound traffic in cases of overload situations with downstream services or performance or capacity issues with the service proxy itself.
The resilience features built into a modern-day proxy are essentially your typical best practice architecture patterns implemented externally to existing services that can be deployed with no code changes, and can be centrally configured and monitored. This provides a very low-friction peace of mind for distributed microservice deployments of most shapes and sizes.
Traffic routing and shaping
Dynamic traffic routing between services/clusters via API-driven settings and a high degree of customisations around traffic routing rules allows service meshes to take on a lot of functionality previously only available via high cost dedicated load balancers. Additionally, the ability to do a percentage based and/or conditional traffic redirection enables more complex scenarios like Blue/Green/Canary deployments, incremental deployments/production testing and seamless migrations from legacy/monolithic applications.
Service Discovery
With potentially hundreds of microservices deployed and recycled in a medium size enterprise, the static IP approach used by legacy load balancers becomes unwieldy. Modern service meshes provide a significantly more dynamic approach that relies on internal (or external) service registry and/or DNS to keep track of available services, their endpoints and load balancing configurations, and enables easy support for capacity scaling, health checking and graceful failure handling. These registries are typically API-enabled and can emit events to also address various use cases not directly supported by out of the box functionality.
Observability.
Contextualised and properly aggregated metrics gathering, tracing and logging across a large distributed footprint is a non-trivial exercise. Most service meshes provide an internal framework to address this or integrate with an external logging aggregator like AWS CloudWatch in a way that preserves context, which significantly speeds up code debugging and issue triage in production or lower environments.
Configuration management and security
All proxy and service configuration settings and policies including Service Discovery need to be tracked, stored and managed. This is typically handled by the control plane, which is different from the data plane, implemented via a sidecar proxy responsible for the actual traffic and communications to, from and between services. Additionally, some service meshes recently started to include various security features, including access control, encryption and auditing, to provide a “single pane of glass” and simplify administration.
Service meshes vs API gateways and load balancers
Looking at the problem space defined by the core aspects above, one cannot miss a healthy overlap with two other types of solutions which address similar concerns – load balancers (health checks, traffic routing and, when combined with scaling sets/groups and capacity management) and API gateways (ingress management, mapping edge APIs to internal services, security). While clearly not the same, the biggest distinction often comes not in terms of functions and features, but rather in terms of solution context and primary focus. Load balancers are typically deployed to manage traffic between more traditional instance-based or low-density containerised services. API gateways are more applicable to edge situations where there is a need to front-end a set of internal services and enable additional features like traffic throttling, API remapping and/or access controls. That being said, there is a steady expansion of service mesh circle in the diagram below, as vendors continue to add new features. With the increased sophistication of ingress controllers and more complex dynamic routing features, one cannot help but imagine future service mesh/task orchestration frameworks that completely avoid the need for traditional edge or traffic management layers.
Major Service Mesh products on the market
There is a significant amount of activity in the service mesh space today, with a competitive line-up that is quite diverse in terms of focus and capabilities. Let's take a look at the major players on the market and review their sweet spots and emerging trends.
HashiCorp's Consul/Consult Connect is one of the more well-known mesh solutions that is based on Envoy proxy and can be deployed both on-premises and in the cloud. The original Consul product was developed to provide robust service discovery/service configuration functionality (control plane) and became quite popular as it was lightweight and easy to deploy and manage. Consul Connect (released in 2018) added network connectivity, security and observability with the ability to connect meshes across data centres. One of the unique capabilities of Consul is the ability to support services across diverse types of compute (containers, VMs, bare metal hardware, etc.). Consul also advertises a pluggable data plane which means it is possible to connect other sidecar proxies (such as Linkerd) should one require a more specialised configuration. Another highlight of Consul is its ability to integrate with other HashiCorp products including Nomad (a workload orchestration platform for any type of compute configuration) and Vault (a central key value store) to provide a complete data/control plane and configuration management functionality for large distributed microservices deployments. These can be based on a hybrid instance or container-based topology which is one of the more attractive features of HashiCorp’s ecosystem.
While the inclusion of Envoy proxy is a relatively recent development for the Consul family of products, HashiCorp has been adding mesh features at a steady pace, bringing their Consult Connect component on par with more advanced offerings out on the market.
Istio is an open source service mesh developed by a consortium of IBM, Lyft and Google in 2017 and is currently part of Google Cloud’s Anthos service offering. It is also based on Envoy proxy and provides one of the more complete mesh feature sets that covers most of the core pillars described above. This comes with a cost of increased complexity, as Istio's control plane includes a total of four internal components and requires an external service catalog (Kubernetes, Consult, etc.) and a state store (etcd) for complete functionality. In addition to an Envoy-based data plane, Istio provides a pluggable policy enforcement and telemetry collection module (Mixer), a mixed-compute compatible Service Discovery component (Pilot), authentication and identity management (Citadel) and a service configuration abstraction layer (Galley).
While Istio is generally focused on Kubernetes, it is possible to leverage some of the features like Service Discovery and traffic routing using more traditional VM-based services via a service mesh expansion feature (this does require a deployed Kubernetes cluster however).
Istio is one of the three key components of GCP’s Anthos offering, which is fast becoming one of the key differentiators for Google Cloud in terms of providing quick application modernisation capabilities for legacy clients via a combination of Migrate for Anthos (a migration service that transforms on-premises legacy applications into containerised workloads), Google Kubernetes Engine (GKE), Knative (K8 simplification/management layer) and Istio. Service mesh and service discovery capabilities play a key role here in providing seamless hybrid and multi-cloud capabilities that work across datacentre and public cloud boundaries.
Linkerd was one of the original competitors to Envoy developed in Rust/Scala by Buoyant and now an open source project under the Cloud Native Computing Foundation. It is marketed as an ultralight/ultrafast service mesh alternative and supports both Kubernetes-based (in 2.x version) and mixed (1.x) workloads (both in active development). Linkerd eschews some of the more advanced services (such as an internal ingress controller or advanced routing) or relies on external components to focus on the core concerns of observability, reliability and security, while providing high performance and easy configuration. In some independent testing, it has been proven significantly faster than Istio and is a good candidate for enterprises looking for simple and flexible mesh implementation.
AWS Service Mesh was introduced during the AWS re:Invent conference in late 2018 and is specific to AWS public cloud. Similar to Istio and Consul Connect, it is also based on Envoy proxy and is compatible with service workloads running on all types of AWS-supported containerised and VM-based systems using a side car Envoy container. While somewhat late to the game, AWS Service Mesh has been steadily gaining features and is likely to be a significant recipient of AWS development investment in the near future, given the budding popularity of meshes in the industry. The immediate capabilities include service discovery, basic routing and observability with the support for more advanced routing scenarios and network resilience coming in near future. AWS Service Mesh provides out of the box integration with AWS Cloud Map (Service Discovery), deployment automation (CloudFormation) and standard AWS telemetry and tracing components (CloudWatch and XRay).
While both AWS and GCP provide well-defined mesh service offerings (along with the ability to self-deploy any of the open source options described above), Microsoft’s direction with Azure is somewhat more complicated.
While Microsoft did roll out Azure Service Fabric Mesh in 2018, it is really a broader offering that combines proprietary container orchestration and service mesh functionality. The service is currently in public preview with limited detailed documentation available for the mesh component. It is known that it is another Envoy derivative that provides core network connectivity, resilience and observability features. In parallel, Microsoft is developing a strong mesh presence in the Kubernetes ecosystem (which it directly supports via Azure’s AKS service) and has partnered with all three major open source companies (Istio, Linkerd and Consul) to provide mesh functionality for AKS. Additionally, Microsoft has played a pivotal role in the SMI (Service Mesh Interface) initiative aimed to create a standard interface and a basic feature set for Kubernetes. Lastly, it is developing an open source Dapr project which uses a concept of a side car proxy to provide a rich run-time injectable interface for composing various aspects of microservices functionality on the fly.
Mesh Framework | Open Source/Proprietary | Soundbite | Mixed compute support | Network resilience |
Traffic routing | Security/Access Control | Service Discovery, Observability |
Other features/notes |
---|---|---|---|---|---|---|---|---|
Istio | Open Source | Feature rich Complex | Requires Kubernetes, can extended to support VM workloads |
Advanced | Advanced (L7) | Advanced (L7), TLS /cert management |
Advanced, Requires external registry |
Required etcd for state management |
Linkerd | Open Source | Lightweight, easy to deploy, high performance | 1.x supports mixed workloads | Advanced | Basic (relies on external ingress controller) | Basic (auto TLS) | Observability via Prometheus, Distributed tracing |
2.x is focused on Kubernetes-only deployments |
Consul Connect | Open Source | HashiCorp ecosystem | Mixed workloads via Consul | Basic via embedded proxy, supports Envoy | Advanced (L7) | Advanced (L4), TLS/cert management |
Advanced, uses Consul registry | Single binary, integrates with other HashiCorp tools (Nomad, Vault) |
Azure Service Fabric Mesh | Proprietary | Azure Service Fabric only | Azure Service Fabric containers only | Detailed specs currently not published by Microsoft |
In Preview Mode Combination of container orchestration and service mesh features |
|||
Open Source: Istio Linkerd Consul |
Azure AKS | Azure AKS | Per vendor support (see above) | |||||
AWS Service Mesh | Proprietary | AWS native | AWS compute (ECS, EKS, EC2, K8 on EC2) | Basic | Basic | Basic- (mTLS support on the roadmap) | Advanced, uses AWS CloudMap and XRay |
A curious case of multi mesh
For large enterprise compute environments that often combine on-premises and public cloud (and sometimes multi-cloud) deployments, it may be beneficial to set up multiple service meshes to optimally tune configuration to specific types of use-cases or non-functional requirements. While some meshes provide a limited capability in this space (i.e. Istio via the mesh expansion feature or Consul Connect via Consul’s broader orchestration capability), there is an emerging need to provide a true “meta” layer that would simplify management across various mesh implementations and enable rapid prototyping and experimentation.
This is the focus of SuperGloo, a “multi-mesh” open-source framework developed by Solo.io aimed to simplify and centralise orchestration and control plane management across most of the popular service meshes. Today, SuperGloo supports most of the popular open source options (Istio, Consul Connect, Linkerd 2) as well as AWS Service Mesh, which is likely to cover 80% or more of the existing mesh market. While the support for specific mesh functionality varies (SuperGloo was first released in 2018 and is yet to reach 1.0) and the framework is not likely to be ready for major production use-cases just yet, it is certainly an interesting option for companies looking to leverage multiple cloud vendors, set up diverse microservice environments or enable experimentation and easy migration between service meshes.
SuperGloo conceptual architecture.
Source: Solo.io
One of the intriguing capabilities of multi-mesh frameworks like SuperGloo is the ability to abstract the ingress controller and effectively connect any mesh implementation with an ingress controller of choice. This creates a great synergy with another Solo.io framework – Gloo – a universal ingress controller that can provide API gateway, routing and security across a wide variety of compute environments, including instances, containers and serverless. A combination of these frameworks effectively eliminates the need for dedicated API gateways and load balancers for modern microservices environments and provides nearly unlimited flexibility in backend deployment and operation of service components.
Oleksiy Volkov
Lead Architect
Oleksiy’s focus is centred on large-scale cloud and enterprise transformation projects. An experienced technical leader and architect, he has spent most of his career building and fixing large distributed systems in the financial services sector. He is also passionate about voice and home automation and has a vast amount of experience in annoying the members of his household with his Alexa and Google Home projects. Outside of work, Oleksiy is an avid outdoorsman and enjoys hiking, skiing and mountain biking.All Categories
Related Articles
-
25 January 2022
Scalable Microservices Architecture with .NET Made Easy – a Tutorial
-
25 August 2020
Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2
-
18 August 2020
Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1
-
23 July 2019
11 Things I wish I knew before working with Terraform – part 2
-
25 June 2019
11 Things I wish I knew before working with Terraform – part 1
-
30 May 2019
Microservices and Serverless Computing
-
14 May 2019
Edge Services
-
28 January 2019
Internet Scale Architecture