<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
 
RSS Feed

DevOps Transformation | Oleksiy Volkov |
04 January 2020

Microservices are taking our industry by storm – the benefits of loose coupling, process isolation and independent deployments are becoming quite clear to both developers and broader enterprises. What is often less clear (and is frequently forgotten or outright ignored) is the impact that managing and keeping the lights on for hundreds of discrete microservices has on DevOps and site reliability teams. The traditional approach of using standardised code libraries for enabling network resilience or run-time monitoring typically doesn’t scale in polyglot environments, and often runs into maintainability and enforcement issues. Thus, the alternative – isolating and separating these concerns into an external, out of process runtime component centrally managed by DevOps teams – became the core concept behind modern service meshes.

The rough perimeter of the problem space tackled by modern service meshes is usually defined as follows:

(1) reliable service (both external and cross-service) communications
(2) dynamic traffic routing and shaping
(3) continuous and easy observability and monitoring
(4) configuration management and security.

Let’s break these down further so we can better understand some of the practical use-case scenarios and benefits from using a service mesh.

Reliable network communications.

The key building block of all service meshes today is a service proxy which is typically implemented using a “side car” pattern. A side car proxy is effectively a separate component that gets deployed in its own process and/or container space and takes over communications between a microservice and external world. This is usually achieved seamlessly via port redirection and typically requires zero code changes on the application side (one of the main reasons that service meshes today are more popular then network resilience libraries of yore). A sample traffic flow for a data plane and side car proxy pattern is shown below.

ECS


A service proxy effectively takes over both north/south and east/west communications and provides a data plane, which can enforce the following patterns to ensure application resilience in case of an adversarial network, compute or traffic situation:

Circuit breaking - preventing global/cascading outages caused by local slowdowns in the stack by quickly disconnecting appropriate services or components.

Automatic retries - ensuring that intermittent downstream issues do not cause massive functional failures in the consuming applications.

Health checks - ensuring that services are up and operational and being able to detect and remove bad services/nodes from load balancing rotations.

Backpressure/rate limits - being able to dynamically respond to and control inbound traffic in cases of overload situations with downstream services or performance or capacity issues with the service proxy itself.

The resilience features built into a modern-day proxy are essentially your typical best practice architecture patterns implemented externally to existing services that can be deployed with no code changes, and can be centrally configured and monitored. This provides a very low-friction peace of mind for distributed microservice deployments of most shapes and sizes.

Traffic routing and shaping

Dynamic traffic routing between services/clusters via API-driven settings and a high degree of customisations around traffic routing rules allows service meshes to take on a lot of functionality previously only available via high cost dedicated load balancers. Additionally, the ability to do a percentage based and/or conditional traffic redirection enables more complex scenarios like Blue/Green/Canary deployments, incremental deployments/production testing and seamless migrations from legacy/monolithic applications.

Service Discovery

With potentially hundreds of microservices deployed and recycled in a medium size enterprise, the static IP approach used by legacy load balancers becomes unwieldy. Modern service meshes provide a significantly more dynamic approach that relies on internal (or external) service registry and/or DNS to keep track of available services, their endpoints and load balancing configurations, and enables easy support for capacity scaling, health checking and graceful failure handling. These registries are typically API-enabled and can emit events to also address various use cases not directly supported by out of the box functionality.

Observability.

Contextualised and properly aggregated metrics gathering, tracing and logging across a large distributed footprint is a non-trivial exercise. Most service meshes provide an internal framework to address this or integrate with an external logging aggregator like AWS CloudWatch in a way that preserves context, which significantly speeds up code debugging and issue triage in production or lower environments.

Configuration management and security

All proxy and service configuration settings and policies including Service Discovery need to be tracked, stored and managed. This is typically handled by the control plane, which is different from the data plane, implemented via a sidecar proxy responsible for the actual traffic and communications to, from and between services. Additionally, some service meshes recently started to include various security features, including access control, encryption and auditing, to provide a “single pane of glass” and simplify administration.

Service meshes vs API gateways and load balancers

Looking at the problem space defined by the core aspects above, one cannot miss a healthy overlap with two other types of solutions which address similar concerns – load balancers (health checks, traffic routing and, when combined with scaling sets/groups and capacity management) and API gateways (ingress management, mapping edge APIs to internal services, security). While clearly not the same, the biggest distinction often comes not in terms of functions and features, but rather in terms of solution context and primary focus. Load balancers are typically deployed to manage traffic between more traditional instance-based or low-density containerised services. API gateways are more applicable to edge situations where there is a need to front-end a set of internal services and enable additional features like traffic throttling, API remapping and/or access controls. That being said, there is a steady expansion of service mesh circle in the diagram below, as vendors continue to add new features. With the increased sophistication of ingress controllers and more complex dynamic routing features, one cannot help but imagine future service mesh/task orchestration frameworks that completely avoid the need for traditional edge or traffic management layers.

Service Meshes


Major Service Mesh products on the market

There is a significant amount of activity in the service mesh space today, with a competitive line-up that is quite diverse in terms of focus and capabilities. Let's take a look at the major players on the market and review their sweet spots and emerging trends.

HashiCorp's Consul/Consult Connect is one of the more well-known mesh solutions that is based on Envoy proxy and can be deployed both on-premises and in the cloud. The original Consul product was developed to provide robust service discovery/service configuration functionality (control plane) and became quite popular as it was lightweight and easy to deploy and manage. Consul Connect (released in 2018) added network connectivity, security and observability with the ability to connect meshes across data centres. One of the unique capabilities of Consul is the ability to support services across diverse types of compute (containers, VMs, bare metal hardware, etc.). Consul also advertises a pluggable data plane which means it is possible to connect other sidecar proxies (such as Linkerd) should one require a more specialised configuration. Another highlight of Consul is its ability to integrate with other HashiCorp products including Nomad (a workload orchestration platform for any type of compute configuration) and Vault (a central key value store) to provide a complete data/control plane and configuration management functionality for large distributed microservices deployments. These can be based on a hybrid instance or container-based topology which is one of the more attractive features of HashiCorp’s ecosystem.

While the inclusion of Envoy proxy is a relatively recent development for the Consul family of products, HashiCorp has been adding mesh features at a steady pace, bringing their Consult Connect component on par with more advanced offerings out on the market.

Istio is an open source service mesh developed by a consortium of IBM, Lyft and Google in 2017 and is currently part of Google Cloud’s Anthos service offering. It is also based on Envoy proxy and provides one of the more complete mesh feature sets that covers most of the core pillars described above. This comes with a cost of increased complexity, as Istio's control plane includes a total of four internal components and requires an external service catalog (Kubernetes, Consult, etc.) and a state store (etcd) for complete functionality. In addition to an Envoy-based data plane, Istio provides a pluggable policy enforcement and telemetry collection module (Mixer), a mixed-compute compatible Service Discovery component (Pilot), authentication and identity management (Citadel) and a service configuration abstraction layer (Galley).

Istio

Source: Istio.io

While Istio is generally focused on Kubernetes, it is possible to leverage some of the features like Service Discovery and traffic routing using more traditional VM-based services via a service mesh expansion feature (this does require a deployed Kubernetes cluster however).

Istio is one of the three key components of GCP’s Anthos offering, which is fast becoming one of the key differentiators for Google Cloud in terms of providing quick application modernisation capabilities for legacy clients via a combination of Migrate for Anthos (a migration service that transforms on-premises legacy applications into containerised workloads), Google Kubernetes Engine (GKE), Knative (K8 simplification/management layer) and Istio. Service mesh and service discovery capabilities play a key role here in providing seamless hybrid and multi-cloud capabilities that work across datacentre and public cloud boundaries.

Linkerd was one of the original competitors to Envoy developed in Rust/Scala by Buoyant and now an open source project under the Cloud Native Computing Foundation. It is marketed as an ultralight/ultrafast service mesh alternative and supports both Kubernetes-based (in 2.x version) and mixed (1.x) workloads (both in active development). Linkerd eschews some of the more advanced services (such as an internal ingress controller or advanced routing) or relies on external components to focus on the core concerns of observability, reliability and security, while providing high performance and easy configuration. In some independent testing, it has been proven significantly faster than Istio and is a good candidate for enterprises looking for simple and flexible mesh implementation.

Control Pane


AWS Service Mesh was introduced during the AWS re:Invent conference in late 2018 and is specific to AWS public cloud. Similar to Istio and Consul Connect, it is also based on Envoy proxy and is compatible with service workloads running on all types of AWS-supported containerised and VM-based systems using a side car Envoy container. While somewhat late to the game, AWS Service Mesh has been steadily gaining features and is likely to be a significant recipient of AWS development investment in the near future, given the budding popularity of meshes in the industry. The immediate capabilities include service discovery, basic routing and observability with the support for more advanced routing scenarios and network resilience coming in near future. AWS Service Mesh provides out of the box integration with AWS Cloud Map (Service Discovery), deployment automation (CloudFormation) and standard AWS telemetry and tracing components (CloudWatch and XRay).

While both AWS and GCP provide well-defined mesh service offerings (along with the ability to self-deploy any of the open source options described above), Microsoft’s direction with Azure is somewhat more complicated.

While Microsoft did roll out Azure Service Fabric Mesh in 2018, it is really a broader offering that combines proprietary container orchestration and service mesh functionality. The service is currently in public preview with limited detailed documentation available for the mesh component. It is known that it is another Envoy derivative that provides core network connectivity, resilience and observability features. In parallel, Microsoft is developing a strong mesh presence in the Kubernetes ecosystem (which it directly supports via Azure’s AKS service) and has partnered with all three major open source companies (Istio, Linkerd and Consul) to provide mesh functionality for AKS. Additionally, Microsoft has played a pivotal role in the SMI (Service Mesh Interface) initiative aimed to create a standard interface and a basic feature set for Kubernetes. Lastly, it is developing an open source Dapr project which uses a concept of a side car proxy to provide a rich run-time injectable interface for composing various aspects of microservices functionality on the fly.

Summary of key service mesh frameworks.

Mesh Framework Open Source/Proprietary Soundbite Mixed compute support Network
resilience
Traffic routing Security/Access Control Service Discovery,
Observability
Other features/notes
Istio Open Source Feature rich Complex Requires Kubernetes, can extended to support VM
workloads
Advanced Advanced (L7) Advanced (L7),
TLS /cert
management
Advanced,
Requires external registry
Required etcd for state management
Linkerd Open Source Lightweight, easy to deploy, high performance 1.x supports mixed workloads Advanced Basic (relies on external ingress controller) Basic (auto TLS) Observability via
Prometheus,
Distributed tracing
2.x is focused on Kubernetes-only deployments
Consul Connect Open Source HashiCorp ecosystem Mixed workloads via Consul Basic via embedded proxy, supports Envoy Advanced (L7) Advanced (L4),
TLS/cert management
Advanced, uses Consul registry Single binary, integrates with other HashiCorp
tools (Nomad, Vault)
Azure Service Fabric Mesh Proprietary Azure Service Fabric only Azure Service Fabric containers only Detailed specs currently not published by
Microsoft
In Preview Mode
Combination of container orchestration and service mesh features
Open Source:
Istio
Linkerd
Consul
Azure AKS Azure AKS Per vendor support (see above)
AWS Service Mesh Proprietary AWS native AWS compute (ECS, EKS, EC2, K8 on EC2) Basic Basic Basic- (mTLS support on the roadmap) Advanced, uses AWS CloudMap and XRay

A curious case of multi mesh

For large enterprise compute environments that often combine on-premises and public cloud (and sometimes multi-cloud) deployments, it may be beneficial to set up multiple service meshes to optimally tune configuration to specific types of use-cases or non-functional requirements. While some meshes provide a limited capability in this space (i.e. Istio via the mesh expansion feature or Consul Connect via Consul’s broader orchestration capability), there is an emerging need to provide a true “meta” layer that would simplify management across various mesh implementations and enable rapid prototyping and experimentation.

This is the focus of SuperGloo, a “multi-mesh” open-source framework developed by Solo.io aimed to simplify and centralise orchestration and control plane management across most of the popular service meshes. Today, SuperGloo supports most of the popular open source options (Istio, Consul Connect, Linkerd 2) as well as AWS Service Mesh, which is likely to cover 80% or more of the existing mesh market. While the support for specific mesh functionality varies (SuperGloo was first released in 2018 and is yet to reach 1.0) and the framework is not likely to be ready for major production use-cases just yet, it is certainly an interesting option for companies looking to leverage multiple cloud vendors, set up diverse microservice environments or enable experimentation and easy migration between service meshes.

SuperGloo conceptual architecture.

SuperGloo


Source: Solo.io

One of the intriguing capabilities of multi-mesh frameworks like SuperGloo is the ability to abstract the ingress controller and effectively connect any mesh implementation with an ingress controller of choice. This creates a great synergy with another Solo.io framework – Gloo – a universal ingress controller that can provide API gateway, routing and security across a wide variety of compute environments, including instances, containers and serverless. A combination of these frameworks effectively eliminates the need for dedicated API gateways and load balancers for modern microservices environments and provides nearly unlimited flexibility in backend deployment and operation of service components.

Oleksiy Volkov

Lead Architect

Oleksiy’s focus is centred on large-scale cloud and enterprise transformation projects. An experienced technical leader and architect, he has spent most of his career building and fixing large distributed systems in the financial services sector. He is also passionate about voice and home automation and has a vast amount of experience in annoying the members of his household with his Alexa and Google Home projects. Outside of work, Oleksiy is an avid outdoorsman and enjoys hiking, skiing and mountain biking.

 

From This Author

 

Archive

  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture

OLDER POSTS