Monitoring Multiple Kubernetes Clusters with Prometheus Federation

In today’s world of microservices and container orchestration, monitoring is vital to ensuring the health and performance of your applications. When you have multiple Kubernetes clusters in your infrastructure, each hosting different workloads, having a centralized monitoring solution becomes crucial. Prometheus, an open-source monitoring and alerting toolkit, is a popular choice due to its flexibility and powerful querying language, PromQL.

This article explores how to effectively monitor and federate data from three Kubernetes clusters. We have one central Prometheus stack that stores data and federates data from two other clusters where Prometheus is deployed using Helm charts with custom configurations.

The Three Kubernetes Clusters

In our setup, we have three Kubernetes clusters. One cluster serves as the central Prometheus stack, while the other two clusters host applications that need monitoring. We’ll refer to these two clusters as “Cluster A” and “Cluster B.”

Configuration of Prometheus Helm Charts

In “Cluster A” and “Cluster B,” we’ve deployed Prometheus using Helm charts with some customizations. One critical customization is disabling persistent volume storage using --set server.persistentVolume.enabled=false. This is done to prevent data storage in the Prometheus instances running in these clusters. Data is centrally stored in the main Prometheus stack.

Central Prometheus Stack Configuration

The central Prometheus stack is where the magic happens. The configuration in its prometheus.yml file is the key to federating data effectively. Let's take a closer look at the relevant configuration:

global:
  # Global configuration options, such as evaluation_interval, scrape_interval, etc.
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'source-prometheus-1:9090'
        - 'source-prometheus-2:9090'
remote_write:
  - url: 'http://main-prometheus-federation:9091/write'

In the prometheus.yml file, we have a scrape_configs section with a job named 'federate.' This job scrapes data from two source Prometheus instances, source-prometheus-1 and source-prometheus-2, using the /federate endpoint. The params section specifies matchers to filter the data for federation, including metrics with job="prometheus" and metric names matching the regular expression job:.*. This configuration is designed to collect specific data from the source Prometheus instances for centralized storage.

The params section in the prometheus.yml file is a powerful tool to filter and select the specific metrics you want to federate. In addition to the example provided above, here are a few more examples of regular expressions you can use to filter metrics based on various criteria

params:
  'match[]':
    - '{job="prometheus"}'  # Match metrics with the "job" label equal to "prometheus"
    - '{__name__=~"job:.*"}'  # Match metrics with metric names that start with "job:"
    - '{job=~"web|app"}'  # Match metrics where the "job" label is either "web" or "app"
    - '{environment!="production"}'  # Match metrics where the "environment" label is not "production"
    - '{status_code=~"2..|3.."}'  # Match metrics where the "status_code" label starts with "2" or "3"

These examples demonstrate how you can use regular expressions and labels to filter metrics based on various conditions. For example:

The first line matches metrics with the “job” label equal to “prometheus.”
The second line matches metrics with metric names that start with “job:” using the =~ operator to specify a regular expression.
The third line matches metrics where the “job” label is either “web” or “app” using the | (pipe) symbol for logical OR.
The fourth line matches metrics where the “environment” label is not “production” using the != (not equal) operator.
The fifth line matches metrics where the “status_code” label starts with “2” or “3,” which is a common pattern for HTTP status codes.

Federating Data with Prometheus

The term “federation” in Prometheus refers to the process of collecting data from one Prometheus instance into another. In our case, the central Prometheus stack is federating data from “Cluster A” and “Cluster B” by querying their /federate endpoints.

Monitoring in Action

This setup offers numerous benefits, including:

Centralized Alerting: You can set up alerting rules in the central Prometheus stack to ensure you are promptly notified of issues in any of the monitored clusters.
Dashboarding: You can create dashboards that consolidate metrics from all three clusters, providing a comprehensive view of your infrastructure.
Aggregated Metrics: With federation, you can easily aggregate and query metrics from multiple clusters, simplifying troubleshooting and analysis.
Scalability and Management: This approach is highly scalable and manageable, making it suitable for even larger, more complex Kubernetes environments.

Best Practices and Considerations

When monitoring multiple Kubernetes clusters, it’s important to consider best practices, including:

Security: Ensure that your setup includes proper security measures such as network policies, access control, and authentication to protect your monitoring infrastructure.

Conclusion

In the world of Kubernetes, monitoring is a critical part of maintaining the health and performance of your applications. With a central Prometheus stack and federation, you can efficiently monitor and aggregate data from multiple clusters, even when deploying Prometheus with custom Helm chart configurations. This setup provides a powerful solution for managing and monitoring complex, multi-cluster Kubernetes environments.

References

💡

Article also published at medium