- The listener is immediately closed on receipt of a shutdown signal.
- All in-progress server connections are now counted, and the process will
not shutdown until the connection count has dropped to zero.
- In the case of HTTP1, idle connections are closed. In the case of HTTP2,
the HTTP2 graceful shutdown steps are followed of sending various
GOAWAYs.
* Switch public API to use cached k8s resources
* Move shared informer code to separate goroutine
* Fix spelling issue
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
Previously when the proxy could tell, by parsing, the request-target
is not in the cluster, it would not override the destination. That is,
load balancing would be disabled for such destinations.
With this change, the proxy will do L7 load balancing for all HTTP
services as long as the request-target has a DNS name.
Signed-off-by: Brian Smith <brian@briansmith.org>
The success rate calculation relies on the `classification` label, but
was incorrectly specifying `fail` rather than `failure`.
Fix public api to specify `failure`. Also re-org public api tests for
easier Kubernetes and Prometheus mocking.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
The StatSummary logic was implemented as a method on http_server.
Move the StatSummary logic into grpc_server, for consistency with the
other endpoints.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The Grafana dashboards key off of deployment, but had no awareness of
namespaces, causing incorrect metrics aggregation and display.
This change makes the Grafana dashboards key off of namespaces, and also
modifies the Grafana links in the Conduit dashboard to link to
namespace+deployment.
Fixes#704
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* CLI: change conduit namespace shorthand flag to -c
All of the conduit CLI subcommands accept a --conduit-namespace flag,
indicating the namespace where conduit is running. Some of the
subcommands also provide a --namespace flag, indicating the kubernetes
namespace where a user's application code is running. To prevent
confusion, I'm changing the shorthand flag for the conduit namespace to
-c, and using the -n shorthand when referring to user namespaces.
As part of this change I've also standardized the capitalization of all
of our command line flags, removed the -r shorthand for the install
--registry flag, and made the global --kubeconfig and --api-addr flags
apply to all subcommands.
* Switch flag descriptions from lowercase to Capital
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
This PR changes the proxy's `control::Cache` module from a set to a key-value map.
This change is made in order to use the values in the map to store metadata from the Destination API, but allow evictions and insertions to be based only on the `SocketAddr` of the destination entry. This will make code in PR #661 much simpler, by removing the need to wrap `SocketAddr`s in the cache in a `Labeled` struct for storing metadata, and the need for custom `Borrow` implementations on that type.
Furthermore, I've changed from using a standard library `HashSet`/`HashMap` as the underlying collection to using `IndexMap`, as we suspect that this will result in performance improvements.
Currently, as `master` has no additional metadata to associate with cache entries, the type of the values in the map is `()`. When #661 merges, the values will actually contain metadata.
If we suspect that there are many other use-cases for `control::Cache` where it will be treated as a set rather than a map, we may want to provide a separate set of impls for `Cache<T, ()>` (like `std::HashSet`) to make the API more ergonomic in this case.
The public-api in the docker-compose environment is not configured to
talk to Prometheus or Kubernetes, which is now required with the new
telemetry pipeline.
Modify the public-api config in docker-compose to connect to k8s and
prom.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The new statsummary command accepted friendly k8s names, which worked
for k8s queries, but Prometheus requires a specific key.
Modify the statsummary query to map friendly k8s names to canonical k8s
names when constructing the query. Then during the query, map the
canonical k8s name to a specific Prometheus label.
Fixes#695
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Link to Grafana from Conduit Dashboard
Previously the only way to access the Grafana dashboards was via direct
link, provided by the `conduit dashboard` command.
Add Grafana links throughout the Conduit Dashboard, next to all
Deployment objects. This change also modifies the behavior of the
ConduitLink helper, to enable linking to other deployments proxied by
the `conduit dashboard` command.
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* review feedback
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* review feedback, fix console, remove absolute
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR adds the pretty-printing for durations I added in #676 to the panic message from the `assert_eventually!` macro added in #669.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Start implementing new conduit stat summary endpoint.
Changes the public-api to call prometheus directly instead of the
telemetry service. Wired through to `api/stat` on the web server,
as well as `conduit statsummary` on the CLI. Works for deployments only.
Current implementation just retrieves requests and mesh/total pod count
(so latency stats are always 0).
Uses API defined in #663
Example queries the stat endpoint will eventually satisfy in #627
This branch includes commits from @klingerf
* run ./bin/dep ensure
* run ./bin/update-go-deps-shas
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Proxy: Refactor DNS name parsing and normalization
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
The Destination service used slightly different labels than the
telemetry pipeline expected, specifically, prefixed with `k8s_*`.
Make all Prometheus labels consistent by dropping `k8s_*`. Also rename
`pod_name` to `pod` for consistency with `deployement`, etc. Also update
and reorganize `proxy-metrics.md` to reflect new labelling.
Fixes#655
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The master ci job executes a `docker-pull master` prior to building, to
bootstrap the Docker image cache. This command fails if the PR being
merged to master introduces a new Docker image, for example:
https://travis-ci.org/runconduit/conduit/jobs/362841328
This changes the master ci job to handle a `docker-pull master` failure
gracefully.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Using a vanilla Grafana Docker image as part of `conduit install`
avoided maintaining a conduit-specific Grafana Docker image, but made
packaging dashboard json files cumbersome.
Roll our own Grafana Docker image, that includes conduit-specific
dashboard json files. This significantly decreases the `conduit install`
output size, and enables dashboard integration in the docker-compose
environment.
Fixes#567
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This branch adds simple pretty-printing to duration in log timeout messages. If the duration is >= 1 second, it's printed in seconds with a fractional part. If the duration is less than 1 second, it is printed in milliseconds. This simple formatting may not be sufficient as a formatting rule for all cases, but should be sufficient for printing our relatively small timeouts.
Log messages now look something like this:
```
ERROR 2018-04-04T20:05:49Z: conduit_proxy: turning operation timed out after 100 ms into 500
```
Previously, they looked like this:
```
ERROR 2018-04-04T20:07:26Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
```
I made this change partially because I wanted to make the panics from the `eventually!` macro added in #669 more readable.
The proxy's `telemetry/metrics/prometheus.rs` file was starting to get long and hard to find one's way around in. I split the prometheus labels code out into a separate submodule and `RequestLabels` and `ResponseLabels` public. This seems like a reasonable division of the code, and the resultant files are much easier to read.
The proxy's control::discovery module is becoming a bit dense in terms
of what it implements.
In order to make this code more understandable, and to be able to use a
similar caching strategy in other parts of the controller, the
`control::cache` module now holds discovery's cache implementation.
This module is only visible within the `control` module, and it now
exposes two new public methods: `values()` and
`set_reset_on_next_modification()`.
* Define a new telemetry Stat API
Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627.
StatSummary will replace Stat once implemented and the original Stat deleted.
* fix pod status and count display in control plane dashboard section:
- the control plane would show terminated and stale deployments in the UI, this is confusing and might indicate errors
- this filters out temrinated and failed component deploys from the UI
- it is to note that pending deploys will still be counted and represented with a greyed out status dot
- Fixes: #606
Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
* Extracted logic from destination server
* Make tests follow style used elsewhere in the code
* Extract single interface for resolvers
* Add tests for k8s and ipv4 resolvers
* Fix small usability issues
* Update dep
* Act on feedback
* Add pod-based metric_labels to destinations response
* Add documentation on running control plane to BUILD.md
Signed-off-by: Phil Calcado <phil@buoyant.io>
* Fix mock controller in proxy tests (#656)
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
* Address review feedback
* Rename files in the destination package
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
The Grafana dashboards were displaying all proxy-enabled pods, including
conduit controller pods. In the old telemetry pipeline filtering these
out required knowledge of the controller's namespace, which the
dashboards are agnostic to.
This change leverages the new `conduit_io_control_plane_component`
prometheus label to filter out proxy-enabled controller components.
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
- Adds environment variables to configure a set of ports that, when an
incoming connection has an SO_ORIGINAL_DST with a port matching, will
disable protocol detection for that connection and immediately start a
TCP proxy.
- Adds a default list of well known ports: SMTP and MySQL.
Closes#339
simulate-proxy uses a deployment object from kubernetes to simulate
each proxy metrics endpoint.
Modify simulate-proxy to instead use a pod to simulate each proxy
metrics endpoint. This ensures that each metrics endpoint consistently
represents a pod in kubernetes, including it's namespace, deployment,
and label information.
This change also adds support for:
- a new `metric-ports` flag, default is `10000-10009`.
- `classification`, `pod_name`, and `pod_template_hash` labels
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Extracted logic from destination server
* Make tests follow style used elsewhere in the code
* Extract single interface for resolvers
* Add tests for k8s and ipv4 resolvers
* Fix small usability issues
* Update dep
* Act on feedback
Signed-off-by: Phil Calcado <phil@buoyant.io>
Previosuly, when the proxy was disconnected from the Destination
service and then reconnects, the proxy would not forget old, outdated
entries in its cache of endpoints. If those endpoints had been removed
while the proxy was disconnected then the proxy would never become
aware of that.
Instead, on the first message after a reconnection, replace the entire
set of cached entries with the new set, which may be empty.
Prior to this change, the new test
outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists
passed already
but outbound_destinations_reset_on_reconnect_followed_by_add_none
and outbound_destinations_reset_on_reconnect_followed_by_remove_none
failed. Now all these tests pass.
Fixes#573
Signed-off-by: Brian Smith <brian@briansmith.org>
The Top-line and Deployment Grafana dashboards relied on the
soon-to-be-removed telemetry pipeline metrics.
Update the Grafana dashboards to query for the new, proxy-based metrics.
Grafana dashboard layouts have not changed.
Depends on #635 to render metrics.
Part of #420.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Proxy: Factor out Destination service connection logic
Centralize the connection initiation logic for the Destination service
to make it easier to maintain. Clarify that the `rx` field isn't needed
prior to a (re)connect.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Rename `rx` to `query`.
Signed-off-by: Brian Smith <brian@briansmith.org>
* "recoonect" -> "reconnect"
Signed-off-by: Brian Smith <brian@briansmith.org>
Previously we were using the instance label to uniquely identify a pod.
This meant that getting stats by pod name would require extra queries to
Kubernetes to map pod name to instance.
This change adds a pod_name label to metrics at collection time. This
should not affect cardinality as pod_name is invariant with respect to
instance.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
this is a known issue with grafana in k8s. grafana/grafana:5.0.4 was just released today. update the repo from 5.0.3 to 5.0.4
fixed issues #582
Signed-off-by: Deshi Xiao <xiaods@gmail.com>
Currently, the CLI docker image copies the entire `controller`
directory, though the CLI only requires a few of its subdirectories.
This causes the CLI's docker cache to be needlessly invalidated when,
for instance, a service implementation changes.
By restricting the copied directories to `controller/{api,public,util}`,
build caching is improved.
remove toggle sorting functionality from TableComponent:
- tables displaying metrics allowed to toggle between being sorted and unsorted when clicking the same button. This was confusing behavior for the user.
- this PR removes the toggle functionality and introduces a BaseTable Component that extends antd's component without the capability to toggle
- Fixes: #566
Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
This PR adds a `classification` label to proxy response metrics, as @olix0r described in https://github.com/runconduit/conduit/issues/634#issuecomment-376964083. The label is either "success" or "failure", depending on the following rules:
+ **if** the response had a gRPC status code, *then*
- gRPC status code 0 is considered a success
- all others are considered failures
+ **else if** the response had an HTTP status code, *then*
- status codes < 500 are considered success,
- status codes >= 500 are considered failures
+ **else if** the response stream failed **then**
- the response is a failure.
I've also added end-to-end tests for the classification of HTTP responses (with some work towards classifying gRPC responses as well). Additionally, I've updated `doc/proxy_metrics.md` to reflect the added `classification` label.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>