The Importance of Anomaly Detection for Service Mesh Monitoring

Yuval_Dror_600x448.jpg

Click to learn more about author Yuval Dror.

Many
companies today have adopted the new norm of rapid iteration in software
development and now live by Mark Zuckerberg’s famous motto, “Move fast and
break things.” This mentality has led to the growth in popularity of a
service-oriented architecture (SOA) approach to software design. In particular,
we’ve seen the rise of microservices, which are an SOA-style approach to
software development where companies deploy business logic in small,
independent services.

While
the microservices approach has several advantages, such as reducing risk, speed
of deployment, and scalability, it also brings its own set of unique
challenges.

As
software development teams are often deploying tens, hundreds, or even
thousands of features each day, one of the main operational challenges with
microservices is to make sure that new features are not breaking anything
within the microservices and, more importantly, to make sure that a change to
one microservice does not break other, dependent microservices.

In
this article, we’ll discuss one of the technologies used to address this
complexity: anomaly detection for service
mesh.

What is Service Mesh?

Service-oriented architectures require dedicated tools that control service-to-service communication. In particular, as network communication between microservices grows in scale and complexity, it becomes impossible to manually manage deployments, troubleshoot issues, and maintain the cluster security. Service mesh technologies give you an additional layer of insights and improve observability, traffic management, and deployment management, as well as enhancing security within the mesh. Many tools and standards are created to address the service mesh complexity; these are summarized on the Layer5 website. CNCF projects such as OpenTelemetry, Envoy, and Prometheus are becoming very popular these days.   

  • OpenTelemetry: OpenTelemetry describes itself as an open-source observability framework. In particular, it provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application.
  • Envoy Proxy: Originally built at the company Lyft, Envoy is an open-source edge and service proxy that is designed specifically for cloud-native applications. They set out to solve two of the main issues with microservices that we’ve discussed: networking and observability.
  • Prometheus: Prometheus is another open-source solution for event monitoring and alerting. It collects real-time metrics from configured targets, evaluates rule expressions, displays results, and can trigger alerts.

Drawbacks of the
Service Mesh Monitoring Paradigm

One
of the main issues with service mesh monitoring tools is that when you have a
large number of microservices, observability is unrealistic and impractical.

In the current paradigm of service mesh monitoring, the tools have some components that are responsible for meeting the service-level agreement. For example, the service mesh Istio collects the following types of measurement in order to provide overall service mesh observability:

  • Metrics: These are generated
    based on the Envoy Proxy statistics. Some are defined by Istio as the “golden
    signals” of monitoring (latency, traffic, errors, and saturation)
  • Distributed Traces: Istio also generates
    distributed trace spans for each service

Open-source
projects like Istio are very useful at collecting metrics that allow developers
to create dashboards. This process works well if you’re dealing with a smaller
application, and there’s a dedicated team monitoring and adjusting alerts. If
you’re working on a project with large-scale deployment, however, these manual
processes are much less effective.

Without
the ability to visually monitor multiple clusters, service mesh technologies
need to go beyond “observing” and move towards automated anomaly detection.

Anomaly Detection for
Service Mesh

Anomaly
detection that employs machine learning has many benefits over traditional
monitoring methods, such as automatically learning the behavioral patterns of
each new microservice and automatically sending alerts when significant changes
are detected. These features allow you to lower the time it takes to detect
anomalies and helps prevent further distribution.

AI-based
anomaly detection integrates with the service mesh as a whole in order to track
high-level KPIs as well as the most granular signals from each microservice.

Anomaly
detection for service mesh monitoring is still an emerging field, although if
you’re reviewing the available solutions, here are a few considerations to keep
in mind:

  • Fully Autonomous: As mentioned, the
    service mesh of large-scale deployments is impossible to monitor manually, so
    the first consideration to make is to ensure that the solution can
    independently track and learn from data in real-time.
  • False Positive Rate: Next, you want to look
    for a solution that has a low false-positive rate as otherwise, this can lead
    to unnecessary noise and create alert fatigue.
  • Correlation: Finally, an AI-based
    anomaly detection solution should be able to automatically learn the topology
    of the mesh and connect the dots.

With
an anomaly detection solution, you not only get alerted about critical
incidents but can also see a chronological list of corrected anomalies. This
means you can easily trace back to the root of the anomaly to ensure it doesn’t
happen again.

As
we’ve discussed, service mesh monitoring has become an essential part of
managing microservices as they provide insights into service-to-service
communication. As the deployment of microservices starts to grow, however,
observability becomes increasingly impractical.

Pairing
service mesh technologies with an AI-based anomaly detection solution solves
this challenge by enabling you to detect real-time incidents and can reliably
reduce your time to resolution.

Credit: Source link