This page introduces best practices for operating a Vald cluster.
Since Vald agents stores vector data on their memory space, unexpected disruption or eviction of agents may cause loss of indices. Also, disruption or deletion of worker nodes that have Vald agents may cause loss of indices. If you need to prevent low accuracy effects caused by indices loss, it is better to increase the number of nodes and pods.
However, to maximize the efficiency of search operations, it is better to have a certain amount of vectors in each NGT vector space.
We recommend to have more than 3 worker nodes with enough memory for the workload.
It is better to deploy 2 or 3 Vald agent pods to each worker node.
If you want to store 100 million vectors with 128 dimensions,
8 bytes (64bit float) x 128 (dimension) x 100 million x N replicas, so a total of 100 GB x N memory space is needed.
If the number of replicas of the index is three, which means N=3, the total amount of memory space for the whole cluster will be 300 GB at least.
- 10 worker nodes with 24 GB RAM and 3 Vald agents on each worker node (total: 240 GB RAM, 30 Vald agents)
- 20 worker nodes with 16 GB RAM and 2 Vald agents on each worker node (total: 320 GB RAM, 40 Vald agents)
On multi-tenant cluster
If you’re going to deploy Vald on multi-tenant cluster, please take care about the followings.
- It is recommended to define PriorityClasses for agents not to be evicted.
- It is recommended to define unique namespaces for each Vald and the other apps.
- Then, please define ResourceQuotas for the namespace for the other apps to limit the memory usage of them.
- For more info, please visit this page Resource Quotas.
The logging level of Vald components can be configured by using
[component].logging.level) field in Helm Chart values.
The level must be a one of “debug”, “info”, “warn”, “error” and “fatal”.
The levels are defined in the Coding Style document.
Observability features of Vald
The observability features are useful for monitoring Vald components. Vald has various types of exporters, such as Prometheus, Jaeger, or Stackdriver. Using this feature, you can observe and visualize the internal stats or the events like the number of NGT index, when to create index, or the number of RPCs.
Enabling observability feature
[component].observability.enabled) in the Helm Chart value set to
true, the observability features become enabled.
If observability features are enabled, the metrics will be collected periodically.
The duration can be set on
If you’d like to use the tracing feature, you should enable it by setting
observability.trace.enabled set to
true. The sampling rate can be configured with
Monitoring Vald cluster using Prometheus and Grafana
To use the Prometheus exporter, you should enable it by setting both
server_config.metrics.prometheus.enabled set to
The exporter port and endpoint are specified in each
Now it’s ready to scrape Vald metrics. Please deploy Prometheus and Grafana to your cluster.
Prometheus can be installed using one of the following.
If you use Prometheus Operator, it is required to set configurations properly along with Prometheus Configuration page. It is recommended to use the endpoints role of the service discovery.
Grafana can be installed using one of the following.
It is required to set your Prometheus to a data source.
Now you can construct your own Grafana dashboard to monitor Vald metrics. This is an example of a custom dashboard. It is based on our standard dashboard settings.
In case of manual deploy
In case of manual deploy, generally, it is required to update your configmaps first. After that, please update the image tags of Vald components in your deployments.
In case of using Helm
In case of using Helm and Vald’s chart, please update
defaults.image.tag field and install it.
In case of using Vald-Helm-Operator
In case of using Vald-Helm-Operator, please update the operator first.
If you’re using
vhor) resource, please update the
spec.image.tag field of it.
On the other hand, please update the operator’s deployment manually.
After that, please update
image.tag field in your valdrelease (or
The operator will automatically detect the changes and update the deployed Vald cluster.