OpenTelemetry: the observability standard your team should adopt now

Week 10 of 52 · Pillar: Observability · Estimated read: 17 min

May 04, 2026

Last week, this series argued that the gap between telemetry and observability is closed by correlation infrastructure — the ability to move from a metric alert to a relevant trace to the specific log entries in a single workflow, in seconds. That correlation depends on consistent, standards-compliant field naming across every pillar of your observability stack.

Which raises an obvious question. Whose standards?

For most of the last fifteen years, the answer was: your observability vendor’s standards. You instrumented your applications using the client libraries they provided. You used the agent they shipped. You structured your logs according to their conventions. And in exchange for that vendor lock-in, you got a product that worked out of the box — until you decided to switch vendors, at which point the cost of migration came due in full. Every integration rewritten. Every dashboard rebuilt. Every alert recalibrated. Every custom metric re-instrumented. An observability vendor migration was, until recently, one of the most painful infrastructure projects an engineering organisation could undertake.

OpenTelemetry changes that calculus permanently. And it is the most important observability investment your team can make in 2026 — not because of what it does today, but because of what it prevents tomorrow.

What OpenTelemetry actually is

OpenTelemetry — OTel, for short — is a Cloud Native Computing Foundation project that provides a vendor-neutral specification, API, and SDK for generating, processing, and exporting telemetry data. It merged two earlier efforts (OpenTracing and OpenCensus) into a single unified standard, and it has become the de facto industry standard for application instrumentation across every major programming language and every major observability vendor.

The word that matters most in that definition is “vendor-neutral.” OpenTelemetry is not a product. It is not an observability platform. It is not competing with Dynatrace, Splunk, Datadog, Grafana, or New Relic. It is the open specification that all of them now support as an ingestion format. Your application instruments using OTel APIs, emits telemetry in OTel’s protocol (OTLP), and that telemetry can be sent to any OTel-compatible backend — including, critically, more than one backend at once.

The three things OpenTelemetry provides:

API → language-specific interfaces your application code calls to emit spans, metrics, and log records. The API is stable and vendor-independent — code instrumented with OTel APIs does not change when you switch observability backends.
SDK → the implementation that sits behind the API, handling context propagation, sampling, batching, and export. You configure the SDK with which backend to send telemetry to. You do not reinstrument your code to change backends.
Collector → a separately deployed agent (or gateway) that receives telemetry from applications, processes it (filtering, enrichment, sampling), and forwards it to one or more backends. This is the piece that makes multi-backend export possible and that handles the gnarly integration concerns between your applications and your observability platforms.

All three are stable, production-grade, and supported by every major observability vendor. Tracing reached general availability first, metrics followed, and logging — the last pillar to stabilise — is now on production-ready footing across all major language SDKs. The standard is no longer experimental. It is the default.

The one-sentence case for OpenTelemetry: it decouples the cost of instrumenting your applications from the cost of choosing your observability platform, and once your instrumentation is in OTel, switching backends becomes a configuration change rather than an engineering project.

Why “now” — the strategic timing case

Teams often react to OpenTelemetry with the same logic they apply to any new standard: “it sounds good, but we will adopt it when it is more mature / when we have more time / when our next observability contract is up for renewal.” This is the wrong framing. There are three reasons why adoption now is strategically different from adoption later.

Reason 1 — your vendor already supports it

Every major observability platform now accepts OTLP as a native ingestion format. Splunk Observability Cloud, Dynatrace, Datadog, Grafana Cloud, New Relic, Honeycomb, Elastic — all of them. This was not true three years ago. The vendor compatibility gate has closed. There is no longer a technical reason to use a vendor-specific agent for new instrumentation if OTel instrumentation will work equally well with your current vendor and every plausible replacement.

Reason 2 — the cost of migration compounds over time

Every new service your team ships that is instrumented with vendor-specific libraries adds to the migration debt you will eventually pay. A fifty-service environment fully instrumented with vendor-specific code is a six-month migration project. The same fifty services instrumented with OTel is a configuration change. The compounding is asymmetric — the team that adopts OTel late pays a migration cost that grows linearly with every service they shipped in the interim. The team that adopts OTel early pays a small upfront cost and then never pays for a migration again.

Reason 3 — your next vendor decision is already partly made

For teams considering an observability platform migration — the Splunk Observability Cloud to Dynatrace evaluation that many organisations are running right now is a specific example — the OTel adoption decision is effectively part of the migration. Committing to OTel before or during the migration means the instrumentation work is done once and then portable forever. Committing to the new vendor’s native agent during the migration locks you in for another cycle and sets up the same migration pain the next time you evaluate.

The observability vendor migration cost, with and without OTel:

Without OTel (vendor-native instrumentation):

  Phase                          Engineering effort
  ─────────────────────────────  ──────────────────────
  Rewrite application            4–8 weeks per language
  instrumentation                × number of languages

  Rebuild dashboards             1–2 weeks per critical
  in new vendor                  service

  Recalibrate alert              2–4 weeks for SLO/burn
  thresholds in new tool         rate alerts

  Re-instrument custom           1–3 weeks per service
  metrics

  Parallel-run both              2–3 months of duplicate
  platforms for validation       vendor cost

Total for 50-service environment: 4–6 months, 2–4 engineers

────────────────────────────────────────────────────────────

With OTel already in place:

  Phase                          Engineering effort
  ─────────────────────────────  ──────────────────────
  Change OTLP export             1 day per Collector
  endpoint in Collector          instance (or IaC
                                 change rolled across)

  Validate data parity           1–2 weeks total

  Rebuild dashboards             1–2 weeks per critical
  (still manual)                 service

  Calibrate alerts in            1–2 weeks for SLO/burn
  new vendor UI                  rate alerts

  Parallel-run both              Data dual-writes to
  platforms for validation       both backends from
                                 Collector — 2–3 weeks

Total for 50-service environment: 4–6 weeks, 1–2 engineers

The difference is not marginal. Teams with OTel adoption already in place complete observability vendor migrations roughly four times faster than teams without, because the work that is actually rewriting integration code is done once and kept.

The OpenTelemetry architecture — what you are actually deploying

OpenTelemetry has three distinct architectural components, and understanding which is which is the prerequisite for any serious adoption discussion. Most confusion about OTel in practice comes from conflating these three layers.

OpenTelemetry architectural layers:

────────────────────────────────────────────────────────────────

Layer 1 — APPLICATION INSTRUMENTATION
  What it is      → code changes in your application to emit
                    spans, metrics, and log records
  Where it lives  → inside your application binary
  What it needs   → language-specific OTel SDK (e.g. OTel
                    Java, Python, Go, .NET, Node.js)
  Two flavours:
    Auto-instrum. → agent or library that instruments common
                    frameworks (HTTP servers, database
                    clients, gRPC, etc.) with no code changes
    Manual instrum. → explicit API calls in your code to
                    create spans and record metrics for
                    business-specific events

────────────────────────────────────────────────────────────────

Layer 2 — THE COLLECTOR
  What it is      → a separate service that receives, processes,
                    and exports telemetry
  Where it lives  → deployed alongside your applications,
                    typically as a DaemonSet in Kubernetes
  Why it exists   → decouples application code from backend
                    choice; handles batching, retry, filtering,
                    enrichment; enables multi-backend export

  Two deployment patterns:
    Agent         → one Collector per node (DaemonSet), apps
                    send telemetry to localhost:4317
    Gateway       → a cluster-wide Collector tier that agents
                    forward to; does the heavy processing

────────────────────────────────────────────────────────────────

Layer 3 — SEMANTIC CONVENTIONS
  What it is      → a set of standardised attribute names for
                    common telemetry dimensions
  Where it lives  → applied at both instrumentation and
                    Collector layers
  Why it matters  → ensures that http.status_code means the
                    same thing in every service; enables
                    automatic correlation between signals

────────────────────────────────────────────────────────────────

Protocol: OTLP (OpenTelemetry Protocol)
  → the wire format used between every layer
  → gRPC-based (preferred) or HTTP-based (fallback)
  → supported as native ingestion by every major backend

The Collector is the piece that matters most

Of the three layers, the OpenTelemetry Collector is the one that does the most work and provides the most strategic flexibility. The Collector is where you implement:

Multi-backend export → send the same telemetry to both Splunk Observability Cloud and Dynatrace simultaneously during a migration, or send metrics to one backend and logs to another permanently
Tail-based sampling → the complex sampling strategy from Week 9 runs in the Collector, not the application — so changes to sampling policy do not require application redeploys
Attribute enrichment → add cluster name, region, environment tags, or Kubernetes metadata automatically to every telemetry item without touching application code
Data filtering and redaction → strip personally identifiable information, filter out health check noise, or drop high-cardinality dimensions before they hit your backend (and your bill)
Protocol translation → receive telemetry in legacy formats (Jaeger, Zipkin, Prometheus) and export to OTLP — a critical capability for migrating instrumentation incrementally

Think of the Collector as the observability equivalent of an Envoy sidecar or an API gateway. It sits at a choke point in your telemetry flow where cross-cutting concerns can be applied consistently, without scattering that logic across hundreds of application codebases. Changes to sampling, enrichment, filtering, or routing happen in one place — usually in GitOps-managed configuration — rather than in every application that emits telemetry.

Semantic conventions — the layer that makes correlation automatic

Week 9 ended with a list of OpenTelemetry semantic conventions and a strong claim: consistent field naming across your logs, metrics, and traces is what makes automatic correlation possible. That claim is worth unpacking further, because the power of semantic conventions is not obvious until you have seen what their absence costs.

Semantic conventions are a published, versioned set of standard attribute names for common telemetry dimensions. They define that the service name is always service.name, never svc or service or app_name. The HTTP status code is always http.status_code, never http_status or status_code or response_code. The W3C trace ID is always trace_id, with the same format and propagation rules in every language’s SDK.

# The OTel semantic convention subset you should adopt immediately
# — these are the highest-leverage attributes for SRE workflows

# Resource attributes (emitted on every signal)
service.name          → logical service identifier
service.version       → deployed version (links to deployments)
service.namespace     → logical grouping (e.g. "payments")
deployment.environment → prod / staging / dev
k8s.cluster.name      → Kubernetes cluster identifier
k8s.namespace.name    → Kubernetes namespace
k8s.pod.name          → pod name (for per-pod troubleshooting)
k8s.node.name         → node the pod runs on

# HTTP attributes (on HTTP request spans and metrics)
http.request.method   → GET, POST, etc.
http.response.status_code → 200, 404, 500, etc.
http.route            → /api/v1/payments (NOT the full URL;
                        full URL would cause cardinality blowup)
url.scheme            → http or https

# Database attributes (on DB client spans)
db.system             → postgresql, redis, mysql
db.operation          → SELECT, INSERT, etc.
db.name               → logical database name

# Messaging attributes (on queue/stream spans)
messaging.system      → kafka, rabbitmq
messaging.destination.name → topic or queue name
messaging.operation   → publish, receive, process

# Error attributes (on error events and logs)
exception.type        → the exception class name
exception.message     → the exception message
exception.stacktrace  → the full stack trace

When every service in your system emits these consistently — in logs, in metric labels, and in trace span attributes — your observability backend does the correlation work for you. A burn rate alert fires with service.name=payment-svc. The dashboard links to traces filtered by that attribute. The traces link to logs carrying the same trace_id. The logs carry the same service.name, which links to the service catalog entry. The four pillars from Week 9 stitch together automatically, not through manual correlation queries at 3 AM.

Why custom conventions are the wrong answer

Many teams, particularly those with existing instrumentation, react to semantic conventions with some version of: “we already use svc everywhere, it would be easier to keep our own names.” This is wrong for three reasons, all of which compound over time.

Your observability backend has built-in dashboards, alerts, and correlation rules that expect the OTel names. Dynatrace, Grafana, and every other modern platform use the OTel conventions as defaults. Using custom names means disabling every out-of-the-box feature and rebuilding equivalent logic manually.
Your auto-instrumentation libraries emit the OTel names. Every OTel auto-instrumentation library — the Java agent, the Python instrumentations, the Go instrumentations — emits conventions-compliant attributes. If your manual instrumentation uses custom names, you have two naming schemes in the same service and every query has to handle both.
Every new engineer who joins has to learn your custom convention. OTel conventions are documented, searchable, and shared across the industry. Custom conventions are tribal knowledge that burns onboarding time and creates inconsistency every time someone forgets the rule.

Adopt the OTel conventions as they are. If you have existing instrumentation with different names, use the Collector’s attribute processor to map old names to convention-compliant names during migration, so applications can be updated over time without breaking queries immediately.

OpenTelemetry in a service mesh environment

For teams running Istio with STRICT mTLS — which Week 9 touched on briefly — the relationship between the service mesh and OpenTelemetry deserves specific attention. Three layers of telemetry exist in this environment, and all three need to fit together cleanly.

Telemetry layers in an Istio mTLS environment:

────────────────────────────────────────────────────────────────

Layer A — Envoy sidecar telemetry (network-level)
  What it sees   → request arrival, response departure, status
                   code at the proxy boundary, TLS events,
                   connection pool state
  What emits it  → Envoy, configured via Istio Telemetry API
  Where it goes  → can emit OTel spans and metrics directly
                   via the OpenTelemetry tracing provider
  Strength       → zero application-code changes required
  Limitation     → no business context, no app-internal timing,
                   no database or downstream call granularity

Layer B — Application telemetry (business-level)
  What it sees   → everything inside the application:
                   business logic, database calls, cache hits,
                   feature flag evaluations, error context
  What emits it  → OTel SDK in the application
  Where it goes  → OTLP to local Collector → backends
  Strength       → rich business context, full call granularity
  Limitation     → requires explicit trace context propagation
                   on every outbound call (sidecar cannot do
                   this for you in STRICT mTLS)

Layer C — Kubernetes and infrastructure telemetry
  What it sees   → pod lifecycle, resource utilisation,
                   scheduling events, node health
  What emits it  → Kubernetes, kubelet, cAdvisor, node exporters
  Where it goes  → scraped by Collector or Prometheus,
                   forwarded to backends
  Strength       → infrastructure saturation and health
  Limitation     → no connection to business context

────────────────────────────────────────────────────────────────

The integration that makes them one system:

  All three layers emit with consistent service.name,
  k8s.pod.name, and k8s.namespace.name attributes.
  Envoy spans become parents of application spans via
  trace context propagation in the request headers.
  Kubernetes events correlate with spans via pod name
  and timestamp, surfaced in the observability UI.

  When done correctly, an engineer investigating a
  slow request can see:
    → Envoy span showing 1.2s inbound latency
    → Application span showing 50ms of app logic
    → Application span showing 1.1s database call
    → Pod metrics showing memory pressure at the same time
    → Kubernetes event showing pod rescheduling
  — all automatically correlated by shared attributes.

Istio Telemetry API and OpenTelemetry

Istio’s Telemetry API can be configured to emit OTel-native spans and metrics directly, rather than the legacy Zipkin or Jaeger formats. This is the correct configuration for any new Istio deployment and the target for any existing deployment to migrate toward. The configuration lives in a Telemetry CRD that is straightforward to manage via GitOps:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: otel-tracing
  namespace: istio-system
spec:
  # Applied to all workloads in the mesh
  tracing:
  - providers:
    - name: otel
    randomSamplingPercentage: 10.0
    customTags:
      # Ensure resource attributes propagate from Envoy
      environment:
        literal:
          value: production
      cluster:
        environment:
          name: K8S_CLUSTER_NAME
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: ALL_METRICS
      tagOverrides:
        # Apply semantic conventions to Envoy-emitted metrics
        destination_service:
          operation: UPSERT
          value: "%{DESTINATION_SERVICE_NAME}"

Automation-first principle for mesh telemetry: the Istio Telemetry configuration and the OpenTelemetry Collector configuration should both live in Git and be deployed via Argo CD, not configured by hand through kubectl. Changes to sampling policy, attribute enrichment, or backend routing become pull requests with review and rollback — the same engineering discipline you apply to application code. Mesh telemetry configuration that is edited directly in-cluster is operational toil that will drift between environments and be forgotten during incident postmortems.

The migration path — how to adopt OTel without a rewrite

The realistic constraint for most teams is that they have existing instrumentation — often years of it — using vendor-specific agents, Prometheus client libraries, or legacy tracing SDKs. “Rewrite everything to OTel” is not a credible migration plan. The good news is that it is also not necessary. OTel is explicitly designed to allow incremental adoption, and the Collector is the piece that makes that adoption path work.

The four-phase adoption roadmap

Phase 1 — Deploy the Collector (no application changes)

Week 1–2
  → Deploy OTel Collector as a DaemonSet in each cluster
  → Configure OTLP, Prometheus, and Jaeger/Zipkin receivers
    (accept telemetry in every current format your apps emit)
  → Export to your existing observability backend(s)
  → Validate data parity: metrics in the backend are identical
    to what was sent before the Collector was introduced

At end of Phase 1: no app changes, Collector is in the path,
                   observability unchanged from the
                   user's perspective

────────────────────────────────────────────────────────────────

Phase 2 — Enable OTel auto-instrumentation for new services

Week 3–6
  → Update service scaffold templates to include OTel
    auto-instrumentation libraries by default
  → New services emit OTel-native telemetry from day one
  → Collector receives OTLP directly from new services,
    legacy formats from existing services — both exported
    consistently to backend

At end of Phase 2: all NEW services are OTel-native;
                   existing services untouched

────────────────────────────────────────────────────────────────

Phase 3 — Migrate existing services opportunistically

Month 2–6
  → When a service is actively being modified for any
    reason (feature work, dependency upgrade, refactor),
    migrate its instrumentation to OTel as part of the
    change
  → Priority services for migration: those involved in
    frequent incidents, those with custom instrumentation
    that breaks during upgrades, those with the highest
    operational burden
  → No "instrumentation migration sprint" — this is
    opportunistic work that rides on other development

At end of Phase 3: 60–80% of services are OTel-native,
                   depending on development velocity

────────────────────────────────────────────────────────────────

Phase 4 — Complete the migration

Month 6–12
  → Remaining services are migrated as deliberate
    reliability investments, justified by specific
    operational pain their legacy instrumentation causes
  → Legacy instrumentation libraries deprecated
  → Collector receivers for legacy formats removed
  → OTel becomes the only instrumentation standard

At end of Phase 4: 100% OTel adoption; vendor migration
                   option is now available as a
                   configuration change

This roadmap is aggressive but achievable, and it is specifically designed to avoid the “big bang migration” that kills most instrumentation standardisation efforts. No team has ever successfully completed a rewrite-everything observability migration. Every successful OTel adoption has followed some version of the pattern above: deploy the compatibility layer first, default new work to the new standard, migrate existing work opportunistically.

The Collector configuration — a production-grade starting point

Most OTel Collector examples online show minimal configurations that are suitable for getting started but insufficient for production use. Here is a closer-to-production Collector configuration for a Kubernetes environment, with the pieces that matter for the SRE framework this series is building.

# otel-collector-config.yaml
# Production-grade Collector configuration for a Kubernetes
# environment with Istio mTLS and SLO-based alerting

receivers:
  # Accept OTLP from applications on standard ports
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus metrics from legacy services
  # during the migration period
  prometheus:
    config:
      scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true

  # Host metrics (node-level CPU, memory, disk, network)
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  # Enrich all telemetry with Kubernetes metadata
  k8sattributes:
    auth_type: "serviceAccount"
    extract:
      metadata:
        - k8s.cluster.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.node.name
        - k8s.deployment.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name
          from: pod

  # Batch telemetry for efficient export
  batch:
    send_batch_size: 10000
    timeout: 10s

  # Tail-based sampling for traces (keep all errors,
  # sample others at 10%)
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  # Drop high-cardinality attributes that would blow
  # up metric cardinality
  attributes/drop-high-cardinality:
    actions:
      - key: user.id
        action: delete
      - key: request.id
        action: delete

  # Memory limit to prevent Collector OOM
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 25

exporters:
  # Primary backend — current observability platform
  otlp/primary:
    endpoint: "splunk-otel-collector.observability:4317"
    tls:
      insecure: false

  # Secondary backend — parallel-run during migration
  # Comment out outside migration window
  otlp/secondary:
    endpoint: "dynatrace-collector.observability:4317"
    tls:
      insecure: false

  # Prometheus remote-write for services that still
  # query Prometheus directly
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, k8sattributes,
                   tail_sampling, batch]
      exporters:  [otlp/primary, otlp/secondary]
    metrics:
      receivers:  [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, k8sattributes,
                   attributes/drop-high-cardinality, batch]
      exporters:  [otlp/primary, otlp/secondary,
                   prometheusremotewrite]
    logs:
      receivers:  [otlp]
      processors: [memory_limiter, k8sattributes,
                   attributes/drop-high-cardinality, batch]
      exporters:  [otlp/primary, otlp/secondary]

Three properties of this configuration deserve specific attention. First, the dual-export pattern (otlp/primary and otlp/secondary) is what makes observability vendor migrations painless — the same telemetry goes to both vendors for the duration of the parallel-run period, and the cutover is a configuration change rather than a rewrite. Second, the tail-based sampling configuration implements exactly the policy Week 9 recommended: keep all error traces and all slow traces unconditionally, sample everything else at 10%. Third, the attribute enrichment via k8sattributes processor adds Kubernetes context to every telemetry item automatically — no application-code changes required.

OpenTelemetry and the observability migration decision

For teams currently evaluating or executing an observability platform migration — the Splunk Observability Cloud to Dynatrace evaluation is a representative example — OTel adoption is not a separate decision from the vendor migration. It is part of the migration. And the sequencing matters.

The correct sequencing:

Adopt OTel first, at least for the Collector layer. Deploy the Collector in your current environment. Point it at your current backend. Validate data parity. This is Phase 1 of the roadmap above and it should happen before the vendor decision is finalised.
Evaluate the new vendor using OTel-emitted telemetry. Your proof-of-concept should demonstrate that OTel-native telemetry — the exact same format your production applications will eventually emit — works correctly end-to-end with the new vendor. This is a stronger POC than one that uses the vendor’s native agent, because it validates the production migration path, not just the product’s features.
Execute the vendor migration as a Collector configuration change. With OTel in place, the migration from Vendor A to Vendor B becomes: change the Collector’s export endpoint, add the new backend as a secondary export during the parallel-run period, cut over primary export when confident, remove the old backend.

Teams that sequence the OTel adoption after the vendor migration pay the migration cost twice: once for the current migration, once for the next one. Teams that sequence it before — or concurrent with the vendor evaluation — pay it once.

The strategic framing for leadership: OpenTelemetry adoption is not a technical choice; it is a strategic optionality purchase. The value of OTel is not what it does today — it is the cost it prevents in every future observability platform decision. Presented this way, OTel is one of the highest-ROI engineering investments available, because the cost of adoption is small and fixed, while the cost of not adopting it compounds with every service your team ships.

Five concrete starts for this week

Deploy the OTel Collector in a non-production cluster. Use the Helm chart or the Kubernetes Operator. Configure it to receive OTLP and to export to your current backend. This is a half-day exercise and it is the foundation everything else in the roadmap builds on.
Enable OTel auto-instrumentation for one service. Pick a service that is not critical enough for the exercise to be scary, but large enough that the instrumentation output is meaningful. Deploy the language-specific auto-instrumentation library (Java agent, Python instrumentation, etc.) with zero code changes. Observe what it emits.
Audit your existing telemetry against semantic conventions. For your most critical service, catalogue the attribute names currently used in logs, metrics, and traces. Map them to the OpenTelemetry conventions. The gap analysis is your first-quarter instrumentation backlog.
Add OTel to your service scaffold template. If your team uses Backstage or a similar internal developer platform, modify the service creation template to include OTel auto-instrumentation by default. From this point forward, every new service is OTel-native without requiring a conscious decision — which is how standards actually get adopted at scale.
Configure Istio Telemetry API for OTel output. If you run a service mesh, the mesh-layer telemetry should flow through OTel rather than through legacy Zipkin/Jaeger paths. This is a GitOps configuration change that takes hours, not weeks, and immediately benefits every service in the mesh.

Next week: Structured logging done right — from printf to queryable events. We will cover the log-to-metric transition, the field conventions that enable log-trace correlation, and the instrumentation patterns that make your log store a first-class participant in the observability framework rather than a parallel, disconnected data silo.

#SRE #OpenTelemetry #OTel #Observability #GoogleSRE #CNCF #DistributedTracing #Reliability #SiteReliabilityEngineering #DevOps

Part of the 52-Week SRE Blog Series · Week 10 of 52

Chronicles of a SRE

Ready for more?