renovate/docs/usage/examples/opentelemetry.md

230 lines
9.3 KiB
Markdown

# OpenTelemetry
Requirements:
- docker-compose
## Prepare setup
Create a `docker-compose.yaml` and `otel-collector-config.yml` file as seen below in a folder.
```yaml title="docker-compose.yaml"
version: '3'
services:
# Jaeger
jaeger:
image: jaegertracing/all-in-one:1.64.0
ports:
- '16686:16686'
- '4317'
otel-collector:
image: otel/opentelemetry-collector-contrib:0.116.1
command: ['--config=/etc/otel-collector-config.yml']
volumes:
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
ports:
- '1888:1888' # pprof extension
- '13133:13133' # health_check extension
- '55679:55679' # zpages extension
- '4318:4318' # OTLP HTTP
- '4317:4317' # OTLP GRPC
- '9123:9123' # Prometheus exporter
depends_on:
- jaeger
```
```yaml title="otel-collector-config.yml"
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
logging:
prometheus:
endpoint: '0.0.0.0:9123'
processors:
batch:
spanmetrics:
metrics_exporter: prometheus
latency_histogram_buckets: [10ms, 100ms, 250ms, 1s, 30s, 1m, 5m]
dimensions:
- name: http.method
- name: http.status_code
- name: http.host
dimensions_cache_size: 1000
aggregation_temporality: 'AGGREGATION_TEMPORALITY_CUMULATIVE'
extensions:
health_check:
pprof:
zpages:
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp]
exporters: [otlp/jaeger, logging]
processors: [spanmetrics, batch]
metrics:
receivers: [otlp]
exporters: [prometheus]
```
Start setup using this command inside the folder containing the files created in the earlier steps:
```
docker-compose up
```
This command will start an [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib) and an instance of [Jaeger](https://www.jaegertracing.io/).
Jaeger will be now reachable under [http://localhost:16686](http://localhost:16686).
## Run Renovate with OpenTelemetry
To start Renovate with OpenTelemetry enabled run following command, after pointing to your `config.js` config file:
```
docker run \
--rm \
-e OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
-v "/path/to/your/config.js:/usr/src/app/config.js" \
renovate/renovate:latest
```
You should now see `trace_id` and `span_id` fields in the logs.
```
INFO: Repository finished (repository=org/example)
"durationMs": 5574,
"trace_id": "f9a4c33852333fc2a0fbdc163100c987",
"span_id": "4ac1323eeaee
```
### Traces
Open now Jaeger under [http://localhost:16686](http://localhost:16686).
You should now be able to pick `renovate` under in the field `service` field.
![service picker](../assets/images/opentelemetry_pick_service.png)
Select `Find Traces` to search for all Renovate traces and then select one of the found traces to open the trace view.
![pick trace](../assets/images/opentelemetry_choose_trace.png)
You should be able to see now the full trace view which shows each HTTP request and internal spans.
![trace view](../assets/images/opentelemetry_trace_viewer.png)
### Metrics
Additional to the received traces some metrics are calculated.
This is achieved using the [spanmetricsprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor).
The previous implemented setup will produce following metrics, which are exposed under [http://localhost:9123/metrics](http://localhost:9123/metrics):
```
# HELP calls_total
# TYPE calls_total counter
### Example of internal spans
calls_total{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 3
calls_total{operation="run",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 1
### Example of http calls from Renovate to external services
calls_total{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 9
...
# HELP latency
# TYPE latency histogram
### Example of internal spans
latency_bucket{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET",le="0.1"} 0
...
latency_bucket{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 3
latency_bucket{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET",le="+Inf"} 3
latency_sum{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 30947.4689
latency_count{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 3
...
### Example of http calls from Renovate to external services
latency_bucket{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET",le="0.1"} 0
...
latency_bucket{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET",le="250"} 3
latency_bucket{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 9
latency_bucket{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET",le="+Inf"} 9
latency_sum{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 2306.1385999999998
latency_count{http_host="api.github.com:443",http_method="POST",http_status_code="200",operation="HTTPS POST",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 9
```
The [spanmetricsprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor) creates two sets of metrics.
#### Calls metric
At first there are the `calls_total` metrics which display how often specific trace spans have been observed.
For example:
`calls_total{operation="renovate repository",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 3` signals that 3 repositories have been renovated.
`calls_total{operation="run",service_name="renovate",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET"} 1` represents how often Renovate has been run.
If we combine this using the PrometheusQueryLanguage ( PromQL ), we can calculate the average count of repositories each Renovate run handles.
```
calls_total{operation="renovate repository",service_name="renovate"} / calls_total{operation="run",service_name="renovate"}
```
This metrics is also for spans generated by http calls:
```yaml
calls_total{http_host="registry.terraform.io:443",http_method="GET",http_status_code="200",operation="HTTPS GET",service_name="renovate",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 5
```
#### Latency buckets
The second class of metrics exposed are the latency focused latency buckets which allow to create [heatmaps](https://grafana.com/docs/grafana/latest/basics/intro-histograms/#heatmaps).
A request is added to a backed if the latency is bigger than the bucket value (`le`). `request_duration => le`
As an example if we receive a request which need `1.533s` to complete get following metrics:
```
latency_bucket{http_host="api.github.com:443",le="0.1"} 0
latency_bucket{http_host="api.github.com:443",le="1"} 0
latency_bucket{http_host="api.github.com:443",le="2"} 1
latency_bucket{http_host="api.github.com:443",le="6"} 1
latency_bucket{http_host="api.github.com:443",le="10"} 1
latency_bucket{http_host="api.github.com:443",le="100"} 1
latency_bucket{http_host="api.github.com:443",le="250"} 1
latency_bucket{http_host="api.github.com:443",le="9.223372036854775e+12"} 1
latency_bucket{http_host="api.github.com:443",le="+Inf"} 1
latency_sum{http_host="api.github.com:443"} 1.533
latency_count{http_host="api.github.com:443"} 1
```
Now we have another request which this time takes 10s to complete:
```
latency_bucket{http_host="api.github.com:443",le="0.1"} 0
latency_bucket{http_host="api.github.com:443",le="1"} 0
latency_bucket{http_host="api.github.com:443",le="2"} 1
latency_bucket{http_host="api.github.com:443",le="6"} 1
latency_bucket{http_host="api.github.com:443",le="10"} 2
latency_bucket{http_host="api.github.com:443",le="100"} 2
latency_bucket{http_host="api.github.com:443",le="250"} 2
latency_bucket{http_host="api.github.com:443",le="9.223372036854775e+12"} 2
latency_bucket{http_host="api.github.com:443",le="+Inf"} 2
latency_sum{http_host="api.github.com:443"} 11.533
latency_count{http_host="api.github.com:443"} 2
```
More about the functionality can be found on the Prometheus page for [metric types](https://prometheus.io/docs/concepts/metric_types/#histogram).