Continuous Benchmark Tool

What is the Continuous Benchmark Tool?

Continuous Benchmark Tool allows you to continuously benchmark the Vald cluster.

Assumed use case is:

Verification with workload close to the production environment
Verification before service installation when Vald version up

Architecture

Continuous Benchmark Tool has following 2 components:

Benchmark Operator: Manages benchmark jobs
Benchmark Job: Executes CRUD request to the target Vald cluster

Benchmark component and its feature

Benchmark Operator

Manages benchmark jobs according to applied manifest.
Apply method:
- Scenario method: one manifest with multiple benchmark jobs
- Job method: one manifest with one benchmark job

Benchmark Job

Executes CRUD request to the target Vald cluster based on defined config.
Execute steps are:
1. Load dataset (valid only for HDF5 format)
2. Execute request with load dataset

Benchmark CRD

Benchmark workload can be set by applying the Kubernetes Custom Resources(CRDs), ValdBenchmarkScenarioResource or ValdBenchmarkJobResource. Benchmark Operator manages benchmark job according to the applied manifest.

ValdBenchmarkJob

ValdBenchmarkJob is used for executing single benchmark job.

And, Benchmark Operator also applies it to the Kubernetes cluster based on ValdBenchmarkScenarioResource.

main properties

Name	mandatory	Description	type	sample
target	*	target Vald cluster	object	ref: target
dataset	*	dataset information	object	ref: dataset
job_type	*	execute job type	string enum: [insert, update, upsert, remove, search, getobject, exists]	search
repetition		the number of job repetitions default: `1`	integer	1
replica		the number of job concurrent job executions default: `1`	integer	2
rps		designed request per sec to the target cluster default: `1000`	integer	1000
concurrency_limit		goroutine count limit for rps adjustment default: `200`	integer	20
ttl_seconds_after_finished		time until deletion of Pod after job end default: `600`	integer	120
insert_config		request config for insert job	object	ref: config
update_config		request config for update job	object	ref: config
upsert_config		request config for upsert job	object	ref: config
search_config		request config for search job	object	ref: config
remove_config		request config for remove job	object	ref: config
object_config		request config for object job	object	ref: config
client_config		gRPC client config for running benchmark job Tune if can not getting the expected performance with default config.	object	ref: defaults.grpc
server_config		server config for benchmark job pod Tune if can not getting the expected performance with default config.	object	ref: defaults.server_config

target

target Vald cluster information
type: object

property	mandatory	description	type	sample
host	*	target cluster’s host	string	localhost
port	*	target cluster’s port	integer	8081

dataset

dataset which is used for executing job operation
type: object

property	mandatory	description	type	sample
name	*	dataset name	string enum: [fashion-mnist, original]	fashion-mnist
group	*	group name	string enum: [train, test, neighbors]	train
indexes	*	amount of index size	integer	1000000
range	*	range of indexes to be used (if there are many indexes, the range will be corrected on the job side)	object	-
range.start	*	start of range	integer	1
range.end	*	end of range	integer	1000000
url		the dataset url. It should be set when set `name` as `original`	string

insert_config

rpc config for insert request
type: object

property	mandatory	description	type	sample
skip_strict_exist_check		Check whether the same vector is already inserted or not. The ID should be unique if the value is `true`.	bool	false
timestamp		The timestamp of the vector inserted. If it is N/A, the current time will be used.	string	1707272658

update_config

rpc config for update request
type: object

property	description	type	sample
skip_strict_exist_check	Check whether the same vector is already inserted or not. The ID should be unique if the value is `true`.	bool	false
timestamp	The timestamp of the vector inserted. If it is N/A, the current time will be used.	string	1707272658
disable_balanced_update	A flag to disable balanced update (split remove -> insert operation) during update operation.	bool	false

upsert_config

rpc config for upsert request
type: object

property	description	type	sample
skip_strict_exist_check	Check whether the same vector is already inserted or not. The ID should be unique if the value is `true`.	bool	false
timestamp	The timestamp of the vector inserted. If it is N/A, the current time will be used.	string	1707272658
disable_balanced_update	A flag to disable balanced update (split remove -> insert operation) during update operation.	bool	false

upsert_config

rpc config for search request
type: object

property	mandatory	description	type	sample
radius		The search radius. default: `-1`	number	-1
epsilon		The search coefficient. default: `0.05`	number	0.05
num	*	The maximum number of results to be returned.	integer	10
min_num		The minimum number of results to be returned.	integer	5
timeout		Search timeout in nanoseconds default: `10s`	string	3s
enable_linear_search		A flag to enable linear search operation for estimating search recall. If it is `true`, search operation with linear operation will execute.	bool	false
aggregation_algorithm		The search aggregation algorithm option. default: `Unknown`	string enum: [“Unknown”, “ConcurrentQueue”, “SortSlice”, “SortPoolSlice”, “PairingHeap”]

remove_config

rpc config for remove request
type: object

property	mandatory	description	type	sample
skip_strict_exist_check		Check whether the same vector is already inserted or not. The ID should be unique if the value is `true`.	bool	false
timestamp		The timestamp of the vector inserted. If it is N/A, the current time will be used.	string	1707272658

object_config

rpc config for get object request
type: object

property	mandatory	description	type	sample
filter_config.targets		filter target host and port for bypassing filter component.	[]object

ValdBenchmarkScenario

ValdBenchmarkScenario is used for executing single or multiple benchmark job.

Benchmark Operator decomposes manifest and creates benchmark resources one by one. The target and dataset property are the global config for scenario, they can be overwritten when each job has own config.

main properties

property	mandatory	description	type	sample
target	*	target Vald cluster information It will be overwritten when each job has own config	object	ref: target
dataset	*	dataset information It will be overwritten when each job has own config	object	ref: dataset
jobs	*	benchmark job config The jobs written above will be executed in order.	object	ref: benchmark job

Deploy Benchmark Operator

Continuous benchmark operator can be applied with Helm same as Vald cluster.

It requires ValdBenchmarkOperatorRelease for deploying vald-benchmark-operator.

It is not must to apply, so please edit and apply as necessary.

`job.client_config` is used as the default gRPC client configuration for all benchmark jobs.
Although it is possible to override using `ValdBenchmarkScenarioRelease` or `ValdBenchmarkJobRelease`, we recommend configuring common setting items here.

Sample ValdBenchmarkOperatorRelease YAML

# @schema {"name": "name", "type": "string"}
# name -- name of the deployment
name: vald-benchmark-operator
# @schema {"name": "time_zone", "type": "string"}
# time_zone -- time_zone
time_zone: ""
# @schema {"name": "image", "type": "object"}
image:
  # @schema {"name": "image.repository", "type": "string"}
  # image.repository -- image repository
  repository: vdaas/vald-benchmark-operator
  # @schema {"name": "image.tag", "type": "string"}
  # image.tag -- image tag
  tag: v1.7.5
  # @schema {"name": "image.pullPolicy", "type": "string", "enum": ["Always", "Never", "IfNotPresent"]}
  # image.pullPolicy -- image pull policy
  pullPolicy: Always
# @schema {"name": "job", "type": "object"}
job:
  # @schema {"name": "job.image", "type": "object"}
  image:
    # @schema {"name": "job.image.repository", "type": "string"}
    # image.repository -- job image repository
    repository: vdaas/vald-benchmark-job
    # @schema {"name": "job.image.tag", "type": "string"}
    # image.tag -- image tag for job docker image
    tag: v1.7.12
    # @schema {"name": "job.image.pullPolicy", "type": "string", "enum": ["Always", "Never", "IfNotPresent"]}
    # image.pullPolicy -- image pull policy
    pullPolicy: Always
  # @schema {"name": "job.client_config", "type": "object"}
  # client_config -- gRPC client config for request to the Vald cluster
  client_config:
    # @schema {"name": "job.client_config.addrs", "type": "array", "items": {"type": "string"}}
    # job.client_config.addrs -- gRPC client addresses
    addrs: []
    # @schema {"name": "job.client_config.health_check_duration", "type": "string"}
    # job.client_config.health_check_duration -- gRPC client health check duration
    health_check_duration: "1s"
    # @schema {"name": "job.client_config.connection_pool", "type": "object"}
    connection_pool:
      # @schema {"name": "job.client_config.connection_pool.enable_dns_resolver", "type": "boolean"}
      # job.client_config.connection_pool.enable_dns_resolver -- enables gRPC client connection pool dns resolver, when enabled vald uses ip handshake exclude dns discovery which improves network performance
      enable_dns_resolver: true
      # @schema {"name": "job.client_config.connection_pool.enable_rebalance", "type": "boolean"}
      # job.client_config.connection_pool.enable_rebalance -- enables gRPC client connection pool rebalance
      enable_rebalance: true
      # @schema {"name": "job.client_config.connection_pool.rebalance_duration", "type": "string"}
      # job.client_config.connection_pool.rebalance_duration -- gRPC client connection pool rebalance duration
      rebalance_duration: 30m
      # @schema {"name": "job.client_config.connection_pool.size", "type": "integer"}
      # job.client_config.connection_pool.size -- gRPC client connection pool size
      size: 3
      # @schema {"name": "job.client_config.connection_pool.old_conn_close_duration", "type": "string"}
      # job.client_config.connection_pool.old_conn_close_duration -- makes delay before gRPC client connection closing during connection pool rebalance
      old_conn_close_duration: "2m"
    # @schema {"name": "job.client_config.backoff", "type": "object", "anchor": "backoff"}
    backoff:
      # @schema {"name": "job.client_config.backoff.initial_duration", "type": "string"}
      # job.client_config.backoff.initial_duration -- gRPC client backoff initial duration
      initial_duration: 5ms
      # @schema {"name": "job.client_config.backoff.backoff_time_limit", "type": "string"}
      # job.client_config.backoff.backoff_time_limit -- gRPC client backoff time limit
      backoff_time_limit: 5s
      # @schema {"name": "job.client_config.backoff.maximum_duration", "type": "string"}
      # job.client_config.backoff.maximum_duration -- gRPC client backoff maximum duration
      maximum_duration: 5s
      # @schema {"name": "job.client_config.backoff.jitter_limit", "type": "string"}
      # job.client_config.backoff.jitter_limit -- gRPC client backoff jitter limit
      jitter_limit: 100ms
      # @schema {"name": "job.client_config.backoff.backoff_factor", "type": "number"}
      # job.client_config.backoff.backoff_factor -- gRPC client backoff factor
      backoff_factor: 1.1
      # @schema {"name": "job.client_config.backoff.retry_count", "type": "integer"}
      # job.client_config.backoff.retry_count -- gRPC client backoff retry count
      retry_count: 100
      # @schema {"name": "job.client_config.backoff.enable_error_log", "type": "boolean"}
      # job.client_config.backoff.enable_error_log -- gRPC client backoff log enabled
      enable_error_log: true
    # @schema {"name": "job.client_config.circuit_breaker", "type": "object"}
    circuit_breaker:
      # @schema {"name": "job.client_config.circuit_breaker.closed_error_rate", "type": "number"}
      # job.client_config.circuit_breaker.closed_error_rate -- gRPC client circuitbreaker closed error rate
      closed_error_rate: 0.7
      # @schema {"name": "job.client_config.circuit_breaker.half_open_error_rate", "type": "number"}
      # job.client_config.circuit_breaker.half_open_error_rate -- gRPC client circuitbreaker half-open error rate
      half_open_error_rate: 0.5
      # @schema {"name": "job.client_config.circuit_breaker.min_samples", "type": "integer"}
      # job.client_config.circuit_breaker.min_samples -- gRPC client circuitbreaker minimum sampling count
      min_samples: 1000
      # @schema {"name": "job.client_config.circuit_breaker.open_timeout", "type": "string"}
      # job.client_config.circuit_breaker.open_timeout -- gRPC client circuitbreaker open timeout
      open_timeout: "1s"
      # @schema {"name": "job.client_config.circuit_breaker.closed_refresh_timeout", "type": "string"}
      # job.client_config.circuit_breaker.closed_refresh_timeout -- gRPC client circuitbreaker closed refresh timeout
      closed_refresh_timeout: "10s"
    # @schema {"name": "job.client_config.call_option", "type": "object"}
    call_option:
      # @schema {"name": "job.client_config.wait_for_ready", "type": "boolean"}
      # job.client_config.call_option.wait_for_ready -- gRPC client call option wait for ready
      wait_for_ready: true
      # @schema {"name": "job.client_config.max_retry_rpc_buffer_size", "type": "integer"}
      # job.client_config.call_option.max_retry_rpc_buffer_size -- gRPC client call option max retry rpc buffer size
      max_retry_rpc_buffer_size: 0
      # @schema {"name": "job.client_config.max_recv_msg_size", "type": "integer"}
      # job.client_config.call_option.max_recv_msg_size -- gRPC client call option max receive message size
      max_recv_msg_size: 0
      # @schema {"name": "job.client_config.max_send_msg_size", "type": "integer"}
      # job.client_config.call_option.max_send_msg_size -- gRPC client call option max send message size
      max_send_msg_size: 0
    # @schema {"name": "job.client_config.dial_option", "type": "object"}
    dial_option:
      # @schema {"name": "job.client_config.dial_option.write_buffer_size", "type": "integer"}
      # job.client_config.dial_option.write_buffer_size -- gRPC client dial option write buffer size
      write_buffer_size: 0
      # @schema {"name": "job.client_config.dial_option.read_buffer_size", "type": "integer"}
      # job.client_config.dial_option.read_buffer_size -- gRPC client dial option read buffer size
      read_buffer_size: 0
      # @schema {"name": "job.client_config.dial_option.initial_window_size", "type": "integer"}
      # job.client_config.dial_option.initial_window_size -- gRPC client dial option initial window size
      initial_window_size: 0
      # @schema {"name": "job.client_config.dial_option.initial_connection_window_size", "type": "integer"}
      # job.client_config.dial_option.initial_connection_window_size -- gRPC client dial option initial connection window size
      initial_connection_window_size: 0
      # @schema {"name": "job.client_config.dial_option.max_msg_size", "type": "integer"}
      # job.client_config.dial_option.max_msg_size -- gRPC client dial option max message size
      max_msg_size: 0
      # @schema {"name": "job.client_config.dial_option.backoff_max_delay", "type": "string"}
      # job.client_config.dial_option.backoff_max_delay -- gRPC client dial option max backoff delay
      backoff_max_delay: "120s"
      # @schema {"name": "job.client_config.dial_option.backoff_base_delay", "type": "string"}
      # job.client_config.dial_option.backoff_base_delay -- gRPC client dial option base backoff delay
      backoff_base_delay: "1s"
      # @schema {"name": "job.client_config.dial_option.backoff_multiplier", "type": "number"}
      # job.client_config.dial_option.backoff_multiplier -- gRPC client dial option base backoff delay
      backoff_multiplier: 1.6
      # @schema {"name": "job.client_config.dial_option.backoff_jitter", "type": "number"}
      # job.client_config.dial_option.backoff_jitter -- gRPC client dial option base backoff delay
      backoff_jitter: 0.2
      # @schema {"name": "job.client_config.dial_option.min_connection_timeout", "type": "string"}
      # job.client_config.dial_option.min_connection_timeout -- gRPC client dial option minimum connection timeout
      min_connection_timeout: "20s"
      # @schema {"name": "job.client_config.dial_option.enable_backoff", "type": "boolean"}
      # job.client_config.dial_option.enable_backoff -- gRPC client dial option backoff enabled
      enable_backoff: false
      # @schema {"name": "job.client_config.dial_option.insecure", "type": "boolean"}
      # job.client_config.dial_option.insecure -- gRPC client dial option insecure enabled
      insecure: true
      # @schema {"name": "job.client_config.dial_option.timeout", "type": "string"}
      # job.client_config.dial_option.timeout -- gRPC client dial option timeout
      timeout: ""
      # @schema {"name": "job.client_config.dial_option.interceptors", "type": "array", "items": {"type": "string", "enum": ["TraceInterceptor"]}}
      # job.client_config.dial_option.interceptors -- gRPC client interceptors
      interceptors: []
      # @schema {"name": "job.client_config.dial_option.net", "type": "object", "anchor": "net"}
      net:
        # @schema {"name": "job.client_config.dial_option.net.dns", "type": "object"}
        dns:
          # @schema {"name": "job.client_config.dial_option.net.dns.cache_enabled", "type": "boolean"}
          # job.client_config.dial_option.net.dns.cache_enabled -- gRPC client TCP DNS cache enabled
          cache_enabled: true
          # @schema {"name": "job.client_config.dial_option.net.dns.refresh_duration", "type": "string"}
          # job.client_config.dial_option.net.dns.refresh_duration -- gRPC client TCP DNS cache refresh duration
          refresh_duration: 30m
          # @schema {"name": "job.client_config.dial_option.net.dns.cache_expiration", "type": "string"}
          # job.client_config.dial_option.net.dns.cache_expiration -- gRPC client TCP DNS cache expiration
          cache_expiration: 1h
        # @schema {"name": "job.client_config.dial_option.net.dialer", "type": "object"}
        dialer:
          # @schema {"name": "job.client_config.dial_option.net.dialer.timeout", "type": "string"}
          # job.client_config.dial_option.net.dialer.timeout -- gRPC client TCP dialer timeout
          timeout: ""
          # @schema {"name": "job.client_config.dial_option.net.dialer.keepalive", "type": "string"}
          # job.client_config.dial_option.net.dialer.keepalive -- gRPC client TCP dialer keep alive
          keepalive: ""
          # @schema {"name": "job.client_config.dial_option.net.dialer.dual_stack_enabled", "type": "boolean"}
          # job.client_config.dial_option.net.dialer.dual_stack_enabled -- gRPC client TCP dialer dual stack enabled
          dual_stack_enabled: true
        # @schema {"name": "job.client_config.dial_option.net.socket_option", "type": "object"}
        socket_option:
          # @schema {"name": "job.client_config.dial_option.net.socket_option.reuse_port", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.reuse_port -- server listen socket option for reuse_port functionality
          reuse_port: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.reuse_addr", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.reuse_addr -- server listen socket option for reuse_addr functionality
          reuse_addr: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.tcp_fast_open", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.tcp_fast_open -- server listen socket option for tcp_fast_open functionality
          tcp_fast_open: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.tcp_no_delay", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.tcp_no_delay -- server listen socket option for tcp_no_delay functionality
          tcp_no_delay: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.tcp_cork", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.tcp_cork -- server listen socket option for tcp_cork functionality
          tcp_cork: false
          # @schema {"name": "job.client_config.dial_option.net.socket_option.tcp_quick_ack", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.tcp_quick_ack -- server listen socket option for tcp_quick_ack functionality
          tcp_quick_ack: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.tcp_defer_accept", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.tcp_defer_accept -- server listen socket option for tcp_defer_accept functionality
          tcp_defer_accept: true
          # @schema {"name": "job.client_config.dial_option.net.socket_option.ip_transparent", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.ip_transparent -- server listen socket option for ip_transparent functionality
          ip_transparent: false
          # @schema {"name": "job.client_config.dial_option.net.socket_option.ip_recover_destination_addr", "type": "boolean"}
          # job.client_config.dial_option.net.socket_option.ip_recover_destination_addr -- server listen socket option for ip_recover_destination_addr functionality
          ip_recover_destination_addr: false
      # @schema {"name": "job.client_config.dial_option.keepalive", "type": "object"}
      keepalive:
        # @schema {"name": "job.client_config.dial_option.keepalive.time", "type": "string"}
        # job.client_config.dial_option.keepalive.time -- gRPC client keep alive time
        time: "120s"
        # @schema {"name": "job.client_config.dial_option.keepalive.timeout", "type": "string"}
        # job.client_config.dial_option.keepalive.timeout -- gRPC client keep alive timeout
        timeout: "30s"
        # @schema {"name": "job.client_config.dial_option.keepalive.permit_without_stream", "type": "boolean"}
        # job.client_config.dial_option.keepalive.permit_without_stream -- gRPC client keep alive permit without stream
        permit_without_stream: true
# @schema {"name": "resources", "type": "object"}
# resources -- kubernetes resources of pod
resources:
  # @schema {"name": "resources.limits", "type": "object"}
  limits:
    cpu: 300m
    memory: 300Mi
  # @schema {"name": "resources.requests", "type": "object"}
  requests:
    cpu: 200m
    memory: 200Mi
# @schema {"name": "logging", "type": "object"}
logging:
  # @schema {"name": "logging.logger", "type": "string", "enum": ["glg", "zap"]}
  # logging.logger -- logger name.
  logger: glg
  # @schema {"name": "logging.level", "type": "string", "enum": ["debug", "info", "warn", "error", "fatal"]}
  # logging.level -- logging level.
  level: debug
  # @schema {"name": "logging.format", "type": "string", "enum": ["raw", "json"]}
  # logging.format -- logging format.
  format: raw

For more details of the configuration of vald-benchmark-operator-release, please refer to here

Add Vald repo into the helm repo

helm repo add vald https://vdaas.vald.org

Deploy vald-benchmark-operator-release

helm install vald-benchmark-operator-release vald/vald-benchmark-operator

Apply vbor.yaml (optional)
```
kubectl apply -f vbor.yaml
```

Running Continuous Benchmarks

After deploy the benchmark operator, you can execute continuous benchmark by applying ValdBenchmarkScenarioRelease or ValdBenchmarkJobRelease.

Please configure designed benchmark and apply by kubectl command.

The sample manifests are here.

Monitoring Benchmark Job Metrics

Metrics monitoring can be set in the same way as Vald cluster. For information on building a monitoring environment, please refer to Observability Configuration.

To monitor metrics about continuous benchmarking, please edit ValdBenchmarkOperatorRelease as follows:

---
# @schema {"name": "observability", "type": "object"}
observability:
  # @schema {"name": "observability.enabled", "type": "boolean"}
  enabled: true
  # @schema {"name": "observability.otlp", "type": "object"}
  otlp:
    # @schema {"name": "observability.otlp.collector_endpoint", "type": "string"}
    # Please confirm correct collector_endpoint
    collector_endpoint: "opentelemetry-collector-collector.default.svc.cluster.local:4317"
  trace:
    # @schema {"name": "observability.trace.enabled", "type": "boolean"}
    enabled: true

After apply it, the metrics can be shown on the Grafana dashboard like as below.

Grafana dashboard showing benchmark metrics

Continuous Benchmark Tool

What is the Continuous Benchmark Tool?

Architecture

Benchmark component and its feature

Benchmark Operator

Benchmark Job

Benchmark CRD

ValdBenchmarkJob

ValdBenchmarkScenario

Deploy Benchmark Operator

Running Continuous Benchmarks

Monitoring Benchmark Job Metrics

See also