istio.io/content/help/troubleshooting/index.md

605 lines
24 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Troubleshooting Guide
description: Advice on tackling common problems with Istio
weight: 40
force_inline_toc: true
draft: true
---
Below is a list of solutions to common problems.
## 503 errors while reconfiguring service routes
When setting route rules to direct traffic to specific versions (subsets) of a service, care must be taken to ensure
that the subsets are available before they are used in the routes. Otherwise, calls to the service may return
503 errors during a reconfiguration period.
Creating both the `VirtualServices` and `DestinationRules` that define the corresponding subsets using a single `istioctl`
call (e.g., `istioctl create -f myVirtualServiceAndDestinationRule.yaml` is not sufficient because the
resources propagate (from the configuration server, i.e., Kubernetes API server) to the Pilot instances in an eventually consistent manner. If the
`VirtualService` using the subsets arrives before the `DestinationRule` where the subsets are defined, the Envoy configuration generated by Pilot would refer to non-existent upstream pools, resulting in HTTP 503 errors until all configuration objects are available to Pilot.
To make sure services will have zero down-time when configuring routes with subsets, follow a "make-before-break" process as described below:
* When adding new subsets:
1. Update `DestinationRules` to add a new subset first, before updating any `VirtualServices` that use it. Apply the rule using istioctl or any platform-specific tooling.
1. Wait a few seconds for the `DestinationRule` configuration to propagate to the Envoys
1. Update the `VirtualService` to refer to the newly added subsets.
* When removing subsets:
1. Update `VirtualServices` to remove any references to a subset, before removing the subset from a `DestinationRule`.
1. Wait a few seconds for the `VirtualService` configuration to propagate to the Envoys
1. Update the `DestinationRule` to remove the unused subsets
## Route rules have no effect on ingress gateway requests
Let's assume you are using an ingress `Gateway` and corresponding `VirtualSerive` to access an internal service.
For example, your `VirtualService` looks something like this:
{{< text yaml >}}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
gateways:
- myapp-gateway
http:
- match:
- uri:
prefix: /hello
route:
- destination:
host: helloworld.default.svc.cluster.local
- match:
...
{{< /text >}}
You also have a `VirtualService` which routes traffic for the helloworld service to a particular subset:
{{< text yaml >}}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: helloworld
spec:
hosts:
- helloworld.default.svc.cluster.local
http:
- route:
- destination:
host: helloworld.default.svc.cluster.local
subset: v1
{{< /text >}}
In this situation you will notice that requests to the helloworld service via the ingress gateway will
not be directed to subset v1 but instead will continue to use default round-robin routing.
The ingress requests are using the gateway host (e.g., `myapp.com`)
which will activate the rules in the myapp `VirtualService` that routes to any endpoint in the helloworld service.
On the other hand, internal requests with host `helloworld.default.svc.cluster.local` will use the
helloworld `VirtualService` which directs traffic exclusively to subset v1.
To control the traffic from the gateway, you need to include the subset rule in the myapp `VirtualService`:
{{< text yaml >}}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
gateways:
- myapp-gateway
http:
- match:
- uri:
prefix: /hello
route:
- destination:
host: helloworld.default.svc.cluster.local
subset: v1
- match:
...
{{< /text >}}
Alternatively, you can combine both `VirtualServices` into one unit if possible:
{{< text yaml >}}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.com # cannot use "*" here since this is being combined with the mesh services
- helloworld.default.svc.cluster.local
gateways:
- mesh # applies internally as well as externally
- myapp-gateway
http:
- match:
- uri:
prefix: /hello
gateways:
- myapp-gateway #restricts this rule to apply only to ingress gateway
route:
- destination:
host: helloworld.default.svc.cluster.local
subset: v1
- match:
- gateways:
- mesh # applies to all services inside the mesh
route:
- destination:
host: helloworld.default.svc.cluster.local
subset: v1
{{< /text >}}
## Route rules have no effect on my application
If route rules are working perfectly for the [Bookinfo](/docs/examples/bookinfo/) sample,
but similar version routing rules have no effect on your own application, it may be that
your Kubernetes services need to be changed slightly.
Kubernetes services must adhere to certain restrictions in order to take advantage of
Istio's L7 routing features.
Refer to the [sidecar injection documentation](/docs/setup/kubernetes/sidecar-injection/#pod-spec-requirements)
for details.
## Verifying connectivity to Istio Pilot
Verifying connectivity to Pilot is a useful troubleshooting step. Every proxy container in the service mesh should be able to communicate with Pilot. This can be accomplished in a few simple steps:
1. Get the name of the Istio Ingress pod:
{{< text bash >}}
$ INGRESS_POD_NAME=$(kubectl get po -n istio-system | grep ingressgateway\- | awk '{print$1}'); echo ${INGRESS_POD_NAME};
{{< /text >}}
1. Exec into the Istio Ingress pod:
{{< text bash >}}
$ kubectl exec -it $INGRESS_POD_NAME -n istio-system /bin/bash
{{< /text >}}
1. Test connectivity to Pilot using cURL. The following example cURL's the v1 registration API using default Pilot configuration parameters and mTLS enabled:
{{< text bash >}}
$ curl -k --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem --key /etc/certs/key.pem https://istio-pilot:15003/v1/registration
{{< /text >}}
If mutual TLS is disabled:
{{< text bash >}}
$ curl http://istio-pilot:15003/v1/registration
{{< /text >}}
You should receive a response listing the "service-key" and "hosts" for each service in the mesh.
## No traces appearing in Zipkin when running Istio locally on Mac
Istio is installed and everything seems to be working except there are no traces showing up in Zipkin when there
should be.
This may be caused by a known [Docker issue](https://github.com/docker/for-mac/issues/1260) where the time inside
containers may skew significantly from the time on the host machine. If this is the case,
when you select a very long date range in Zipkin you will see the traces appearing as much as several days too early.
You can also confirm this problem by comparing the date inside a docker container to outside:
{{< text bash >}}
$ docker run --entrypoint date gcr.io/istio-testing/ubuntu-16-04-slave:latest
Sun Jun 11 11:44:18 UTC 2017
{{< /text >}}
{{< text bash >}}
$ date -u
Thu Jun 15 02:25:42 UTC 2017
{{< /text >}}
To fix the problem, you'll need to shutdown and then restart Docker before reinstalling Istio.
## Envoy won't connect to my HTTP/1.0 service
Envoy requires HTTP/1.1 or HTTP/2 traffic for upstream services. For example, when using [NGINX](https://www.nginx.com/) for serving traffic behind Envoy, you
will need to set the [proxy_http_version](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_http_version) directive in your NGINX config to be "1.1", since the NGINX default is 1.0
Example config:
{{< text plain >}}
upstream http_backend {
server 127.0.0.1:8080;
keepalive 16;
}
server {
...
location /http/ {
proxy_pass http://http_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
...
}
}
{{< /text >}}
## No Grafana output when connecting from a local web client to Istio remotely hosted
Validate the client and server date and time match.
The time of the web client (e.g. Chrome) affects the output from Grafana. A simple solution
to this problem is to verify a time synchronization service is running correctly within the
Kubernetes cluster and the web client machine also is correctly using a time synchronization
service. Some common time synchronization systems are NTP and Chrony. This is especially
problematic is engineering labs with firewalls. In these scenarios, NTP may not be configured
properly to point at the lab-based NTP services.
## Where are the metrics for my service?
The expected flow of metrics is:
1. Envoy reports attributes to Mixer in batch (asynchronously from requests)
1. Mixer translates the attributes from Mixer into instances based on
operator-provided configuration.
1. The instances are handed to Mixer adapters for processing and backend storage.
1. The backend storage systems record metrics data.
The default installations of Mixer ship with a [Prometheus](https://prometheus.io/)
adapter, as well as configuration for generating a basic set of metric
values and sending them to the Prometheus adapter. The
[Prometheus add-on](/docs/tasks/telemetry/querying-metrics/#about-the-prometheus-add-on)
also supplies configuration for an instance of Prometheus to scrape
Mixer for metrics.
If you do not see the expected metrics in the Istio Dashboard and/or via
Prometheus queries, there may be an issue at any of the steps in the flow
listed above. Below is a set of instructions to troubleshoot each of
those steps.
### Verify Mixer is receiving Report calls
Mixer generates metrics for monitoring the behavior of Mixer itself.
Check these metrics.
1. Establish a connection to the Mixer self-monitoring endpoint.
In Kubernetes environments, execute the following command:
{{< text bash >}}
$ kubectl -n istio-system port-forward <mixer pod> 9093 &
{{< /text >}}
1. Verify successful report calls.
On the [Mixer self-monitoring endpoint](http://localhost:9093/metrics),
search for `grpc_server_handled_total`.
You should see something like:
{{< text plain >}}
grpc_server_handled_total{grpc_code="OK",grpc_method="Report",grpc_service="istio.mixer.v1.Mixer",grpc_type="unary"} 68
{{< /text >}}
If you do not see any data for `grpc_server_handled_total` with a
`grpc_method="Report"`, then Mixer is not being called by Envoy to report
telemetry. In this case, ensure that the services have been properly
integrated into the mesh (either by via
[automatic](/docs/setup/kubernetes/sidecar-injection/#automatic-sidecar-injection)
or [manual](/docs/setup/kubernetes/sidecar-injection/#manual-sidecar-injection) sidecar injection).
### Verify Mixer metrics configuration exists
1. Verify Mixer rules exist.
In Kubernetes environments, issue the following command:
{{< text bash >}}
$ kubectl get rules --all-namespaces
NAMESPACE NAME KIND
istio-system promhttp rule.v1alpha2.config.istio.io
istio-system promtcp rule.v1alpha2.config.istio.io
istio-system stdio rule.v1alpha2.config.istio.io
{{< /text >}}
If you do not see anything named `promhttp` or `promtcp`, then there is
no Mixer configuration for sending metric instances to a Prometheus adapter.
You will need to supply configuration for rules that connect Mixer metric
instances to a Prometheus handler.
<!-- todo replace ([example](https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml#L892)). -->
1. Verify Prometheus handler config exists.
In Kubernetes environments, issue the following command:
{{< text bash >}}
$ kubectl get prometheuses.config.istio.io --all-namespaces
NAMESPACE NAME KIND
istio-system handler prometheus.v1alpha2.config.istio.io
{{< /text >}}
If there are no prometheus handlers configured, you will need to reconfigure
Mixer with the appropriate handler configuration.
<!-- todo replace ([example](https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml#L819)) -->
1. Verify Mixer metric instances config exists.
In Kubernetes environments, issue the following command:
{{< text bash >}}
$ kubectl get metrics.config.istio.io --all-namespaces
NAMESPACE NAME KIND
istio-system requestcount metric.v1alpha2.config.istio.io
istio-system requestduration metric.v1alpha2.config.istio.io
istio-system requestsize metric.v1alpha2.config.istio.io
istio-system responsesize metric.v1alpha2.config.istio.io
istio-system stackdriverrequestcount metric.v1alpha2.config.istio.io
istio-system stackdriverrequestduration metric.v1alpha2.config.istio.io
istio-system stackdriverrequestsize metric.v1alpha2.config.istio.io
istio-system stackdriverresponsesize metric.v1alpha2.config.istio.io
istio-system tcpbytereceived metric.v1alpha2.config.istio.io
istio-system tcpbytesent metric.v1alpha2.config.istio.io
{{< /text >}}
If there are no metric instances configured, you will need to reconfigure
Mixer with the appropriate instance configuration.
<!-- todo replace ([example](https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml#L727)) -->
1. Verify Mixer configuration resolution is working for your service.
1. Establish a connection to the Mixer self-monitoring endpoint.
Setup a `port-forward` to the Mixer self-monitoring port as described in
[Verify Mixer is receiving Report calls](#verify-mixer-is-receiving-report-calls).
1. On the [Mixer self-monitoring port](http://localhost:9093/metrics), search
for `mixer_config_resolve_count`.
You should find something like:
{{< text plain >}}
mixer_config_resolve_count{error="false",target="details.default.svc.cluster.local"} 56
mixer_config_resolve_count{error="false",target="ingress.istio-system.svc.cluster.local"} 67
mixer_config_resolve_count{error="false",target="mongodb.default.svc.cluster.local"} 18
mixer_config_resolve_count{error="false",target="productpage.default.svc.cluster.local"} 59
mixer_config_resolve_count{error="false",target="ratings.default.svc.cluster.local"} 26
mixer_config_resolve_count{error="false",target="reviews.default.svc.cluster.local"} 54
{{< /text >}}
1. Validate that there are values for `mixer_config_resolve_count` where
`target="<your service>"` and `error="false"`.
If there are only instances where `error="true"` where `target=<your service>`,
there is likely an issue with Mixer configuration for your service. Logs
information is needed to further debug.
In Kubernetes environments, retrieve the Mixer logs via:
{{< text bash >}}
$ kubectl -n istio-system logs <mixer pod> -c mixer
{{< /text >}}
Look for errors related to your configuration or your service in the
returned logs.
More on viewing Mixer configuration can be found [here](/help/faq/mixer/#mixer-self-monitoring)
### Verify Mixer is sending metric instances to the Prometheus adapter
1. Establish a connection to the Mixer self-monitoring endpoint.
Setup a `port-forward` to the Mixer self-monitoring port as described in
[Verify Mixer is receiving Report calls](#verify-mixer-is-receiving-report-calls).
1. On the [Mixer self-monitoring port](http://localhost:9093/metrics), search
for `mixer_adapter_dispatch_count`.
You should find something like:
{{< text plain >}}
mixer_adapter_dispatch_count{adapter="prometheus",error="false",handler="handler.prometheus.istio-system",meshFunction="metric",response_code="OK"} 114
mixer_adapter_dispatch_count{adapter="prometheus",error="true",handler="handler.prometheus.default",meshFunction="metric",response_code="INTERNAL"} 4
mixer_adapter_dispatch_count{adapter="stdio",error="false",handler="handler.stdio.istio-system",meshFunction="logentry",response_code="OK"} 104
{{< /text >}}
1. Validate that there are values for `mixer_adapter_dispatch_count` where
`adapter="prometheus"` and `error="false"`.
If there are are no recorded dispatches to the Prometheus adapter, there
is likely a configuration issue. Please see
[Verify Mixer metrics configuration exists](#verify-mixer-metrics-configuration-exists).
If dispatches to the Prometheus adapter are reporting errors, check the
Mixer logs to determine the source of the error. Most likely, there is a
configuration issue for the handler listed in `mixer_adapter_dispatch_count`.
In Kubernetes environment, check the Mixer logs via:
{{< text bash >}}
$ kubectl -n istio-system logs <mixer pod> -c mixer
{{< /text >}}
Filter for lines including something like `Report 0 returned with: INTERNAL
(1 error occurred:` (with some surrounding context) to find more information
regarding Report dispatch failures.
### Verify Prometheus configuration
1. Connect to the Prometheus UI and verify that it can successfully
scrape Mixer.
In Kubernetes environments, setup port-forwarding as follows:
{{< text bash >}}
$ kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=prometheus -o jsonpath='{.items[0].metadata.name}') 9090:9090 &
{{< /text >}}
1. Visit [http://localhost:9090/targets](http://localhost:9090/targets) and confirm that the target `istio-mesh` has a status of **UP**.
1. Visit [http://localhost:9090/config](http://localhost:9090/config) and confirm that an entry exists that looks like:
{{< text yaml >}}
- job_name: istio-mesh
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
separator: ;
regex: istio-system;istio-telemetry;prometheus
replacement: $1
action: keep
{{< /text >}}
## How can I debug issues with the service mesh?
### With istioctl
Istioctl allows you to inspect the current xDS of a given Envoy from its admin interface (locally) or from Pilot using the `proxy-config` or `pc` command.
For example, to retrieve the configured clusters in an Envoy via the admin interface run the following command:
{{< text bash >}}
$ istioctl proxy-config endpoint <pod-name> clusters
{{< /text >}}
To retrieve endpoints for a given pod in the application namespace from Pilot run the following command:
{{< text bash >}}
$ istioctl proxy-config pilot -n application <pod-name> eds
{{< /text >}}
The `proxy-config` command also allows you to retrieve the state of the entire mesh from Pilot using the following command:
{{< text bash >}}
$ istioctl proxy-config pilot mesh ads
{{< /text >}}
### With GDB
To debug Istio with `gdb`, you will need to run the debug images of Envoy / Mixer / Pilot. A recent `gdb` and the golang extensions (for Mixer/Pilot or other golang components) is required.
1. `kubectl exec -it PODNAME -c [proxy | mixer | pilot]`
1. Find process ID: ps ax
1. gdb -p PID binary
1. For go: info goroutines, goroutine x bt
### With Tcpdump
Tcpdump doesn't work in the sidecar pod - the container doesn't run as root. However any other container in the same pod will see all the packets, since the network namespace is shared. `iptables` will also see the pod-wide config.
Communication between Envoy and the app happens on 127.0.0.1, and is not encrypted.
## Envoy is crashing under load
Check your `ulimit -a`. Many systems have a 1024 open file descriptor limit by default which will cause Envoy to assert and crash with:
{{< text plain >}}
[2017-05-17 03:00:52.735][14236][critical][assert] assert failure: fd_ != -1: external/envoy/source/common/network/connection_impl.cc:58
{{< /text >}}
Make sure to raise your ulimit. Example: `ulimit -n 16384`
## Headless TCP services losing connection from Istiofied containers
If `istio-citadel` is deployed, Envoy is restarted every 15 minutes to refresh certificates.
This causes the disconnection of TCP streams or long-running connections between services.
You should build resilience into your application for this type of
disconnect, but if you still want to prevent the disconnects from
happening, you will need to disable mutual TLS and the `istio-citadel` deployment.
First, edit your `istio` config to disable mutual TLS
{{< text bash >}}
$ kubectl edit configmap -n istio-system istio
$ kubectl delete pods -n istio-system -l istio=pilot
{{< /text >}}
Next, scale down the `istio-citadel` deployment to disable Envoy restarts.
{{< text bash >}}
$ kubectl scale --replicas=0 deploy/istio-citadel -n istio-system
{{< /text >}}
This should stop Istio from restarting Envoy and disconnecting TCP connections.
## Envoy Process High CPU Usage
For larger clusters, the default configuration that comes with Istio
refreshes the Envoy configuration every 1 second. This can cause high
CPU usage, even when Envoy isn't doing anything. In order to bring the
CPU usage down for larger deployments, increase the refresh interval for
Envoy to something higher, like 30 seconds.
{{< text bash >}}
$ kubectl edit configmap -n istio-system istio
$ kubectl delete pods -n istio-system -l istio=pilot
{{< /text >}}
Also make sure to reinject the sidecar into all of your pods, as
their configuration needs to be updated as well.
Afterwards, you should see CPU usage fall back to 0-1% while idling.
Make sure to tune these values for your specific deployment.
*Warning:*: Changes created by routing rules will take up to 2x refresh interval to propagate to the sidecars.
While the larger refresh interval will reduce CPU usage, updates caused by routing rules may cause a period
of HTTP 404s (up to 2x the refresh interval) until the Envoy sidecars get all relevant configuration.
## Automatic sidecar injection will fail if the kube-apiserver has proxy settings
When the Kube-apiserver included proxy settings such as:
{{< text yaml >}}
env:
- name: http_proxy
value: http://proxy-wsa.esl.foo.com:80
- name: https_proxy
value: http://proxy-wsa.esl.foo.com:80
- name: no_proxy
value: 127.0.0.1,localhost,dockerhub.foo.com,devhub-docker.foo.com,10.84.100.125,10.84.100.126,10.84.100.127
{{< /text >}}
The sidecar injection would fail. The only related failure logs was in the kube-apiserver log:
{{< text plain >}}
W0227 21:51:03.156818 1 admission.go:257] Failed calling webhook, failing open sidecar-injector.istio.io: failed calling admission webhook "sidecar-injector.istio.io": Post https://istio-sidecar-injector.istio-system.svc:443/inject: Service Unavailable
{{< /text >}}
Make sure both pod and service CIDRs are not proxied according to *_proxy variables. Check the kube-apiserver files and logs to verify the configuration and whether any requests are being proxied.
A workaround is to remove the proxy settings from the kube-apiserver manifest and restart the server or use a later version of Kubernetes.
An issue was filed with Kubernetes related to this and has since been closed. [https://github.com/kubernetes/kubeadm/issues/666](https://github.com/kubernetes/kubeadm/issues/666)
[https://github.com/kubernetes/kubernetes/pull/58698#discussion_r163879443](https://github.com/kubernetes/kubernetes/pull/58698#discussion_r163879443)
## What Envoy version is istio using?
To find out the envoy version, you can follow below steps:
1. `kubectl exec -it PODNAME -c istio-proxy -n NAMESPACE /bin/bash`
1. `curl localhost:15000/server_info`