Observability Configuration

Observability for the Vald cluster helps to check each Vald component status by monitoring metrics, logs, and traces. By enabling observability, you can monitor and visualize the number of indexes, the timing of creating indexes, the number of RPC, CPU resource usage, Memory resource usage, events, etc.

This page shows the best practice for applying observing features by constructing an observability environment and setting the Vald Helm chart.

Architecture

Vald conforms OpenTelemetry Protocol and does NOT depend on any commercial data format.

OpenTelemetry Collector handles receiving / processing / exporting the telemetry data, which does NOT depend on the vendors’ format. All Vald components can send OTLP-compliant telemetry data, like metrics, traces, or logs, to the OpenTelemetry Collector. The telemetry data is used to monitor or visualize with the observing tools.

Build an observing environment

The recommended observability environment consists of the following:

  • Cert-Manager
  • Jaeger Operator and Jaeger
  • Prometheus Operator and Prometheus
  • Grafana
  • OpenTelemetry Operator and OpenTelemetry Collector

Vald provides the default manifests and the make commands for deploying those components. Please refer to the following sections to deploy each component.

You can change parameters such as the export host by editing the manifest for your demands.

Cert-Manager

A Cert-Manager is required to enable to use of operators on the Kubernetes Cluster.

make k8s/external/cert-manager/deploy

Jaeger Operator and Jaeger

Jaeger is helpful for monitoring trace data. It can be available by running the following:

make k8s/metrics/jaeger/deploy

Prometheus Operator and Prometheus

Vald recommends using Prometheus as a backend service by the following command to monitor the metrics data.

make k8s/metrics/prometheus/operator/deploy

Grafana

Vald recommends using Grafana to visualize metrics data.

make k8s/metrics/grafana/deploy

OpenTelemetry Operator and OpenTelemetry Collector

Vald uses OpenTelemetry Collector to get the telemetry data and export it to monitor backend components. The following command deploys OpenTelemetry Collector via OpenTelemetry Collector.

Before executing the following command, please ensure the Prometheus operator runs healthy.
make k8s/otel/operator/deploy k8s/otel/collector/deploy

Deploy Observability components with a simple command

If you would like to deploy observability components with a simple command, please use the following command.

make k8s/external/cert-manager/deploy k8s/monitoring/deploy

Configure Helm chart

This section shows how to set values.yaml to enable each Vald component to send its own telemetry data.

The setting points are the following:

  1. Enable observability feature
  2. Enable sending system metrics
  3. Enable sending trace data
  4. Set OpenTelemetry parameters
The general settings are described for convenience, but individual settings are possible for each component by editing `[component].observability`.

Enable observability feature

To enable the observability feature, you must set defaults.observability.enabled as true.

defaults:
  observability:
    # enable observability
    enabled: true

Enable sending system metrics

Each Vald component sends the system metrics by editing defaults.observability.metrics. All metrics are enabled by default. Vald recommends using default values unless there is a specific reason.

defaults:
  observability:
    # enable version info metrics
    metrics:
      enable_version_info: true
      # If the enable_version_info setting is true, this information will be added to the keys of the version info metrics.
      version_info_labels:
        - "vald_version"
        - "server_name"
        - "git_commit"
        - "build_time"
        - "go_version"
        - "go_os"
        - "go_arch"
        - "algorithm_info"
      # enable memory metrics
      enable_memory: true
      # enable goroutine metrics
      enable_goroutine: true
      # enable cgo metrics
      enable_cgo: true

Enable sending trace data

Please set defaults.observability.trace.enabled as true to enable sending trace data. The default value is false.

defaults:
  observability:
    trace:
      # enable to send trace data
      enabled: true

OpenTelemetry settings

This section shows the detailed settings for sending telemetry data.

Specify OpenTelemetry Collector endpoint

To send the telemetry data, you must set the OpenTelemetry Collector’s endpoint to defaults.observability.otlp.collector_endpoint. The default value is not set.

It is required to send telemetry data.
defaults:
  observability:
    otlp:
      collector_endpoint: "opentelemetry-collector-collector.default.svc.cluster.local:4317"

Specify the Telemetry attribute

You can add the component information to the attribute of telemetry data by editing defaults.observability.otlp.attirbute. E.g., when setting vald-agent-ngt-0 as agent.observability.otlp.attribute.pod_name, target_pod: vald-agent-ngt-0 will be added to the attribute. These attributes are set auto by the environment values, so Vald recommends using default values unless there is a specific reason.

defaults:
  observability:
    otlp:
      attribute:
        # deployed namespace
        namespace: vald
        # pod name
        pod_name: vald-agent-ngt-0
        # deployed node name
        node_name: kube-worker01
        # service name
        service_name: vald-agent-ngt

Customize send configuration

You can modify the sending telemetry data behavior by changing the default.observability.otlp parameters.

defaults:
  observability:
    otlp:
      # Maximum duration for constructing a batch from the queue. The Processor forcefully sends available spans when timeout is reached.
      trace_batch_timeout: "1s"
      # Maximum duration for exporting trace spans
      trace_export_timeout: "1m"
      # Maximum batch size of trace spans.
      trace_max_export_batch_size: 1024
      # Maximum queue size to buffer trace spans for delayed processing.
      trace_max_queue_size: 256
      # Export interval for metrics
      metrics_export_interval: "1s"
      # Maximum duration for exporting metrics
      metrics_export_timeout: "1m"

gRPC Configuration

The interceptor configuration is required to send the metrics and trace data related to gRPC. You can add the interceptor to the server-side and client-side by editing defaults.server_config.servers.grpc.server.grpc.interceptors parameters.

defaults:
  server_config:
    servers:
      grpc:
        server:
          grpc:
            # gRPC Server interceptor.
            interceptors:
              - TraceInterceptor
              - MetricInterceptor
  grpc:
    # gRPC Client interceptor.
    client:
      dial_option:
        interceptors:
          - TraceInterceptor

Monitoring telemetry data

Telemetry data can be monitored using Grafana, Jaeger, etc. Vald publishes the sample manifest, which enables Grafana and Jaeger.

You can apply it after creating an observability environment.

The default manifests don't set ingress host.
You can monitor the browser by port forwarding or define the ingress host by yourself.

Cleanup

Lastly, the Vald cluster will be removed by executing the following command.

helm uninstall vald

Also, the observability components will be removed by executing the following command.

make k8s/monitoring/delete k8s/external/cert-manager/delete

And the observability components can also be removed by executing the following command.

make k8s/otel/collector/delete \
  k8s/otel/operator/delete \
  k8s/metrics/grafana/delete \
  k8s/metrics/jaeger/delete \
  k8s/metrics/prometheus/operator/delete \
  k8s/external/cert-manager/delete

See also