docs/telemetry.md

8.0 KiB

Logs and metrics

First, deploy monitoring components. You can use two different setups:

  1. everything: This configuration collects logs & metrics from user containers, build controller and istio requests.
bazel run config/monitoring:everything.apply
  1. everything-dev: This configuration collects everything in (1) plus Elafros controller logs.
bazel run config/monitoring:everything-dev.apply

Accessing logs

Run,

kubectl proxy

Then open Kibana UI at this link (it might take a couple of minutes for the proxy to work). When Kibana is opened the first time, it will ask you to create an index. Accept the default options as is. As logs get ingested, new fields will be discovered and to have them indexed, go to Management -> Index Patterns -> Refresh button (on top right) -> Refresh fields.

Accessing metrics

Run:

kubectl port-forward -n monitoring $(kubectl get pods -n monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000

Then open Grafana UI at http://localhost:3000

Accessing per request traces

First open Kibana UI as shown above. Browse to Management -> Index Patterns -> +Create Index Pattern and type "zipkin*" (without the quotes) to the "Index pattern" text field and hit "Create" button. This will create a new index pattern that will store per request traces captured by Zipkin. This is a one time step and is needed only for fresh installations.

Next, start the proxy if it is not already running:

kubectl proxy

Then open Zipkin UI at this link. Click on "Find Traces" to see the latest traces. You can search for a trace ID or look at traces of a specific application within this UI. Click on a trace to see a detailed view of a specific call.

To see a demo of distributed tracing, deploy the Telemetry sample, send some traffic to it and explore the traces it generates from Zipkin UI.

Default metrics

Following metrics are collected by default:

  • Elafros controller metrics
  • Istio metrics (mixer, envoy and pilot)
  • Node and pod metrics

There are several other collectors that are pre-configured but not enabled. To see the full list, browse to config/monitoring/prometheus-exporter and config/monitoring/prometheus-servicemonitor folders and deploy them using kubectl apply -f.

Default logs

Deployment above enables collection of the following logs:

  • stdout & stderr from all ela-container
  • stdout & stderr from build-controller

To enable log collection from other containers and destinations, edit fluentd-es-configmap.yaml (search for "fluentd-containers.log" for the starting point). Then run the following:

kubectl replace -f config/monitoring/fluentd/fluentd-es-configmap.yaml
kubectl replace -f config/monitoring/fluentd/fluentd-es-ds.yaml

Note: We will enable a plugin mechanism to define other logs to collect and this step is a workaround until then.

Metrics troubleshooting

You can use Prometheus web UI to troubleshoot publishing and service discovery issues for metrics. To access to the web UI, forward the Prometheus server to your machine:

kubectl port-forward -n monitoring $(kubectl get pods -n monitoring --selector=app=prometheus --output=jsonpath="{.items[0].metadata.name}") 9090

Then browse to http://localhost:9090 to access the UI:

  • To see the targets that are being scraped, go to Status -> Targets
  • To see what Prometheus service discovery is picking up vs. dropping, go to Status -> Service Discovery

Generating metrics

See Telemetry Sample for deploying a dedicated instance of Prometheus and emitting metrics to it.

If you want to generate metrics within Elafros services and send them to shared instance of Prometheus, follow the steps below. We will create a counter metric below:

  1. Go through Prometheus Documentation and read Data model and Metric types sections.
  2. Create a top level variable in your go file and initialize it in init() - example:
    import "github.com/prometheus/client_golang/prometheus"

    var myCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: "elafros",
        Name:      "mycomponent_something_count",
        Help:      "Counter to keep track of something in my component",
    }, []string{"status"})

    func init() {
        prometheus.MustRegister(myCounter)
    }
  1. In your code where you want to instrument, increment the counter with the appropriate label values - example:
    err := doSomething()
    if err == nil {
        myCounter.With(prometheus.Labels{"status": "success"}).Inc()
    } else {
        myCounter.With(prometheus.Labels{"status": "failure"}).Inc()
    }
  1. Start an http listener to serve as the metrics endpoint for Prometheus scraping (this step and onwards are needed only once in a service. ela-controller is already setup for metrics scraping and you can skip rest of these steps if you are targeting ela-controller):
    import "github.com/prometheus/client_golang/prometheus/promhttp"

    // In your main() func
    srv := &http.Server{Addr: ":9090"}
    http.Handle("/metrics", promhttp.Handler())
    go func() {
        if err := srv.ListenAndServe(); err != nil {
            glog.Infof("Httpserver: ListenAndServe() finished with error: %s", err)
        }
    }()

    // Wait for the service to shutdown
    <-stopCh

    // Close the http server gracefully
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    srv.Shutdown(ctx)

  1. Add a Service for the metrics http endpoint:
apiVersion: v1
kind: Service
metadata:
  labels:
    app: myappname
    prometheus: myappname
  name: myappname
  namespace: mynamespace
spec:
  ports:
  - name: metrics
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: myappname # put the appropriate value here to select your application
  1. Add a ServiceMonitor to tell Prometheus to discover pods and scrape the service defined above:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myappname
  namespace: monitoring
  labels:
    monitor-category: ela-system # Shared Prometheus instance only targets 'k8s', 'istio', 'node',
                                 # 'prometheus' or 'ela-system' - if you pick something else,
                                 # you need to deploy your own Prometheus instance or edit shared
                                 # instance to target the new category
spec:
  selector:
    matchLabels:
      app: myappname
      prometheus: myappname
  namespaceSelector:
    matchNames:
    - mynamespace
  endpoints:
  - port: metrics
    interval: 30s
  1. Add a dashboard for your metrics - you can see examples of it under config/grafana/dashboard-definition folder. An easy way to generate JSON definitions is to use Grafana UI (make sure to login with as admin user) and export JSON from it.

  2. Add the YAML files to BUILD files.

  3. Deploy changes with bazel.

  4. Validate the metrics flow either by Grafana UI or Prometheus UI (see Troubleshooting section above to enable Prometheus UI)

Generating logs

Use glog to write logs in your code. In your container spec, add the following args to redirect the logs to stderr:

args:
- "-logtostderr=true"
- "-stderrthreshold=INFO"

See helloworld sample's configuration file as an example.

Distributed tracing with Zipkin

Check Telemetry sample as an example usage of OpenZipkin's Go client library.