Added preliminary documentation for the zPages contrib module (#1455)
|
@ -0,0 +1,168 @@
|
|||
# OpenTelemetry SDK Contrib - /tracez and /traceconfigz Design Doc
|
||||
|
||||
This file contains information about the design choices for the OpenTelemetry /tracez and
|
||||
/traceconfigz zPages.
|
||||
|
||||
## Introduction
|
||||
|
||||
The OpenTelemetry zPages are a set of dynamically generated HTML pages that display trace and
|
||||
metrics data for a running process: the /tracez zPage displays trace information about running
|
||||
spans, sample span latencies, and sample error spans, while the /traceconfigz zPage is a web page
|
||||
that allows the users to change tracing parameters, such as sampling probability and max number of
|
||||
attributes.
|
||||
|
||||
### /tracez zPage
|
||||
|
||||
The /tracez zPage displays information on running spans, sample span latencies, and sample error
|
||||
spans. The data is aggregated into a summary-level table:
|
||||
|
||||

|
||||
|
||||
You can click on each of the counts in the table cells to access the corresponding span
|
||||
details. For example, here are the details of the `ChildSpan` latency sample (row 1, col 4):
|
||||
|
||||

|
||||
|
||||
### /traceconfigz zPage
|
||||
|
||||
The /traceconfigz zPage displays information about the currently active tracing configuration and
|
||||
provides an interface for users to modify relevant parameters:
|
||||
|
||||

|
||||
|
||||
## Motivation
|
||||
|
||||
We are building the Java zPages in order to create a lightweight application performance monitoring
|
||||
tool that allows users to troubleshoot OpenTelemetry instrumentation.
|
||||
|
||||
## Design
|
||||
|
||||
### Frontend
|
||||
|
||||
#### HttpHandler
|
||||
|
||||
The `HttpHandler` is responsible for rendering corresponding HTML content. An abstract base class,
|
||||
`ZPageHandler` (OpenCensus implementation), is implemented to standardize handlers for different
|
||||
zPages. Each page will implement their own `ZPageHandler`, extending the base class, to generate the
|
||||
corresponding HTML content for that page.
|
||||
|
||||

|
||||
|
||||
##### TraceZ Handler
|
||||
|
||||
For the `TracezZPageHandler` class, the span data from `TracezDataAggregator` will be passed in when
|
||||
an instance of the class is created. It will later be used to retrieve span information and display
|
||||
the data in a table.
|
||||
|
||||
##### TraceConfigZ Handler
|
||||
|
||||
For the `TraceConfigzZPageHandler` class, the `TraceConfig` class will be used to change sampling
|
||||
rate and tracing parameters.
|
||||
|
||||
#### HttpServer
|
||||
The `HttpServer` is responsible for listening to incoming requests, obtaining requested data, and
|
||||
rendering corresponding HTML content. The `HttpServer` class from `com.sun.net` will be used to
|
||||
handle http requests and responses on different routes (users need to ensure that they are using a
|
||||
version of JDK that includes the `HttpServer` class; this requirement will be added to the README).
|
||||
Once a request is received by the `HttpServer`, it will invoke the handle function which in turn
|
||||
invokes the `emitHtml` function to render the HTML content.
|
||||
|
||||
The `HttpServer` class utilizes `com.sun.net.httpserver` to create servers. Users need to ensure
|
||||
that they are using a version of JDK that comes with the package.
|
||||
|
||||
##### Handling Requests
|
||||
|
||||

|
||||
|
||||
### Backend
|
||||
|
||||
#### Overview
|
||||
|
||||
The proposed structure is encompassed by the following diagram:
|
||||
|
||||

|
||||
|
||||
Spans, which are the units of tracing, are monitored by a `SpanProcessor`. The `SpanProcessor` is
|
||||
exposed to a `DataAggregator`, which can retrieve information about the spans through API calls.
|
||||
Lastly, the frontend calls functions in the `DataAggregator` to obtain information needed for the
|
||||
web page.
|
||||
|
||||
#### SpanProcessor
|
||||
|
||||
A `SpanProcessor` watches the lifecycle of each span, and its functions are called every time a span
|
||||
starts or ends. For the /tracez zPage, I have implemented a `TracezSpanProcessor`, which will
|
||||
maintain two data structures: a running span cache and a completed span cache.
|
||||
|
||||
##### Visual Diagram
|
||||
|
||||
Below is a visual diagram of the `TracezSpanProcessor` class:
|
||||
|
||||

|
||||
|
||||
When a span starts, it is first added to the `runningSpanCache`. Once that span ends, it is removed
|
||||
from the `runningSpanCache` and added to the `completedSpanCache` as either a latency sample if
|
||||
there are no errors or an error sample if there are.
|
||||
|
||||
##### Initial Design
|
||||
|
||||
At first, we planned to have both the running span and completed span caches map span IDs to spans.
|
||||
The problem with this setup was that the number of completed spans would grow without bound. Note
|
||||
that the number of running spans is naturally limited by the SDK, so the size of the running span
|
||||
cache will never grow too large at any given time. However, the same cannot be said of the completed
|
||||
span cache. In order to limit the number of completed spans, we had to consider an alternative data
|
||||
structure that could impose the necessary limits.
|
||||
|
||||
##### Proposed Design
|
||||
|
||||
To constrain the number of completed spans, we built a new class called `TracezSpanBuckets` and
|
||||
reworked the completed span cache to map span names to `TracezSpanBuckets` instances. The
|
||||
`TracezSpanBuckets` class uses FIFO evicting queues to limit the number of latency samples per
|
||||
bucket to 16 and the number of error samples per bucket to 8. With a hard limit, this setup ensures
|
||||
that the completed span cache does not grow too large too quickly. For reference, OpenCensus
|
||||
restricted the number of latency samples per bucket to 10 and the number of error samples per bucket
|
||||
to 5.
|
||||
|
||||
#### DataAggregator
|
||||
|
||||
The `DataAggregator` restructures the data from the `SpanProcessor` into an accessible format for
|
||||
the frontend to display. For this, I have implemented a class called `TracezDataAggregator`. This
|
||||
class is constructed with a `TracezSpanProcessor` instance, so that the `TracezDataAggregator` class
|
||||
can access the spans collected by a specific `TracezSpanProcessor`.
|
||||
|
||||
##### Proposed Design
|
||||
|
||||
The `TracezDataAggregator`'s purpose is to simplify the frontend's job of displaying information.
|
||||
Consequently, the class supports functions for retrieving spans names, span counts, along with the
|
||||
spans themselves. The frontend can then directly use the data collections that are returned.
|
||||
|
||||
##### Accessing the TracezSpanProcessor
|
||||
|
||||
When a user instruments their code, they first create a `SpanProcessor` and then add that instance
|
||||
to a `TracerSdkProvider` with `addSpanProcessor`. An example from the official quickstart docs is
|
||||
shown below:
|
||||
|
||||

|
||||
|
||||
While implementing the `TracezDataAggregator`, we faced the issue of how the backend was supposed to
|
||||
access a `TracezSpanProcessor` created by the user. At the moment, there is no way for developers to
|
||||
access the span processor instances that are added to the `TracerSdkProvider`. Consequently, we
|
||||
propose that the `HttpServer` class add the `TracezSpanProcessor` itself, rather than requiring the
|
||||
user to create and add the instance. This can be done by replicating the code in the above example:
|
||||
use the `getTracerProvider` function in the `OpenTelemetrySdk` class and then call addProcessor with
|
||||
the returned object. Note that `getTracerProvider` in the `OpenTelemetrySdk` class calls
|
||||
`getTracerProvider` in the `OpenTelemetry` class, which returns a Singleton instance. This means
|
||||
that the `OpenTelemetrySdk` class should return a Singleton instance as well, so the backend will
|
||||
get the same `TracerSdkProvider` that the user uses.
|
||||
|
||||
#### TraceConfigZ
|
||||
|
||||
The final component of this project is the /traceconfigz zPage, which allows the user to update the
|
||||
config for the /tracez zPage in real-time. In OpenTelemetry, updates are already handled by the
|
||||
TraceSdkProvider class, so we only needed to wire them up and write the corresponding HTML. Since
|
||||
most of the infrastructure was already built, there was no real design aspect to this.
|
||||
|
||||
### Sequence Diagram
|
||||
|
||||
Below is a sequence diagram of how the frontend and backend components will communicate:
|
||||
|
||||

|
After Width: | Height: | Size: 85 KiB |
After Width: | Height: | Size: 31 KiB |
After Width: | Height: | Size: 11 KiB |
After Width: | Height: | Size: 50 KiB |
After Width: | Height: | Size: 6.1 KiB |
After Width: | Height: | Size: 13 KiB |
After Width: | Height: | Size: 194 KiB |
After Width: | Height: | Size: 286 KiB |
After Width: | Height: | Size: 219 KiB |
|
@ -2,8 +2,100 @@
|
|||
|
||||
[![Javadocs][javadoc-image]][javadoc-url]
|
||||
|
||||
This module contains code for OpenTelemetry's Java zPages.
|
||||
This module contains code for OpenTelemetry's Java zPages, which are a collection of dynamic HTML
|
||||
web pages that display stats and trace data.
|
||||
|
||||
* Java 7 compatible.
|
||||
|
||||
<!--- TODO: Update javadoc -->
|
||||
[javadoc-image]: https://www.javadoc.io/badge/io.opentelemetry/opentelemetry-sdk-contrib-auto-config.svg
|
||||
[javadoc-url]: https://www.javadoc.io/doc/io.opentelemetry/opentelemetry-sdk-contrib-auto-config
|
||||
|
||||
## Quickstart
|
||||
|
||||
### Add the dependencies to your project
|
||||
|
||||
For Maven, add the following to your `pom.xml`:
|
||||
```xml
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>io.opentelemetry</groupId>
|
||||
<artifactId>opentelemetry-api</artifactId>
|
||||
<version>0.7.0</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>io.opentelemetry</groupId>
|
||||
<artifactId>opentelemetry-sdk</artifactId>
|
||||
<version>0.7.0</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>io.opentelemetry</groupId>
|
||||
<artifactId>opentelemetry-sdk-extension-zpages</artifactId>
|
||||
<version>0.7.0</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
```
|
||||
|
||||
<!--- TODO: Verify gradle configuration -->
|
||||
For Gradle, add the following to your dependencies:
|
||||
```groovy
|
||||
implementation 'io.opentelemetry:opentelemetry-api:0.7.0'
|
||||
implementation 'io.opentelemetry:opentelemetry-sdk:0.7.0'
|
||||
implementation 'io.opentelemetry:opentelemetry-sdk-extension-zpages:0.7.0'
|
||||
```
|
||||
|
||||
### Register the zPages
|
||||
|
||||
**Note:** The package `com.sun.net.httpserver` is required to use the default zPages setup. Please make sure your
|
||||
version of the JDK includes this package.
|
||||
|
||||
To set-up the zPages, simply call `ZPageServer.startHttpServerAndRegisterAllPages(int port)` in your
|
||||
main function:
|
||||
|
||||
```java
|
||||
public class MyMainClass {
|
||||
public static void main(String[] args) throws Exception {
|
||||
ZPageServer.startHttpServerAndRegisterAllPages(8080);
|
||||
// ... do work
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Alternatively, you can call `ZPageServer.registerAllPagesToHttpServer(HttpServer server)` to
|
||||
register the zPages to a shared server:
|
||||
|
||||
```java
|
||||
public class MyMainClass {
|
||||
public static void main(String[] args) throws Exception {
|
||||
HttpServer server = HttpServer.create(new InetSocketAddress(8000), 10);
|
||||
ZPageServer.registerAllPagesToHttpServer(server);
|
||||
server.start();
|
||||
// ... do work
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Access the zPages
|
||||
|
||||
#### View all available zPages on the `/` index page
|
||||
|
||||
The index page `/` lists all available zPages with a link and description.
|
||||
|
||||
#### View trace spans on the `/tracez` zPage
|
||||
|
||||
The /tracez zPage displays information on running spans, sample span latencies, and sample error
|
||||
spans. The data is aggregated into a summary-level table:
|
||||
|
||||

|
||||
|
||||
You can click on each of the counts in the table cells to access the corresponding span
|
||||
details. For example, here are the details of the `ChildSpan` latency sample (row 1, col 4):
|
||||
|
||||

|
||||
|
||||
#### View and update the tracing configuration on the `/traceconfigz` zPage
|
||||
|
||||
The /traceconfigz zPage displays information about the currently active tracing configuration and
|
||||
provides an interface for users to modify relevant parameters. Here is what the web page looks like:
|
||||
|
||||

|
After Width: | Height: | Size: 194 KiB |
After Width: | Height: | Size: 286 KiB |
After Width: | Height: | Size: 219 KiB |