boulder

Commit Graph

Author	SHA1	Message	Date
Aaron Gable	5e1bc3b501	Simplify the features package (#7204 ) Replace the current three-piece setup (enum of feature variables, map of feature vars to default values, and autogenerated bidirectional maps of feature variables to and from strings) with a much simpler one-piece setup: a single struct with one boolean-typed field per feature. This preserves the overall structure of the package -- a single global feature set protected by a mutex, and Set, Reset, and Enabled methods -- although the exact function signatures have all changed somewhat. The executable config format remains the same, so no deployment changes are necessary. This change does deprecate the AllowUnrecognizedFeatures feature, as we cannot tell the json config parser to ignore unknown field names, but that flag is set to False in all of our deployment environments already. Fixes https://github.com/letsencrypt/boulder/issues/6802 Fixes https://github.com/letsencrypt/boulder/issues/5229	2023-12-12 15:51:57 -05:00
Jacob Hoffman-Andrews	a2b2e53045	cmd: fail without panic (#6935 ) For "ordinary" errors like "file not found" for some part of the config, we would prefer to log an error and exit without logging about a panic and printing a stack trace. To achieve that, we want to call `defer AuditPanic()` once, at the top of `cmd/boulder`'s main. That's so early that we haven't yet parsed the config, which means we haven't yet initialized a logger. We compromise: `AuditPanic` now calls `log.Get()`, which will retrieve the configured logger if one has been set up, or will create a default one (which logs to stderr/stdout). AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic with a special type, and AuditPanic checks for that type and prints a simple message before exiting when it's present. This PR also coincidentally fixes a bug: panicking didn't previously cause the program to exit with nonzero status, because it recovered the panic but then did not explicitly exit nonzero. Fixes #6933	2023-06-20 12:29:02 -07:00
Samantha	124c4cc6f5	grpc/sa: Implement deep health checks (#6928 ) Add the necessary scaffolding for deep health checking of our various gRPC components. Each component implementation that also implements the grpc.checker interface will be checked periodically, and the health status of the component will be updated accordingly. Add the necessary methods to SA to implement the grpc.checker interface and register these new health checks with Consul. Additionally: - Update entry point script to check for ProxySQL readiness. - Increase the poll rate for gRPC Consul checks from 5s to 2s to help with DNS failures, due to check failures, on startup. - Change log level for Consul from INFO to ERROR to deal with noisy logs full of transport failures due to Consul gRPC checks firing before the SAs are up. Fixes #6878 Part of #6795	2023-06-12 13:58:53 -04:00
Phil Porada	17fb1b287f	cmd: Export prometheus metrics for TLS cert notBefore and notAfter fields (#6836 ) Export new prometheus metrics for the `notBefore` and `notAfter` fields to track internal certificate validity periods when calling the `Load()` method for a `*tls.Config`. Each metric is labeled with the `serial` field. ``` tlsconfig_notafter_seconds{serial="2152072875247971686"} 1.664821961e+09 tlsconfig_notbefore_seconds{serial="2152072875247971686"} 1.664821960e+09 ``` Fixes https://github.com/letsencrypt/boulder/issues/6829	2023-04-24 16:28:05 -04:00
Matthew McPherrin	0060e695b5	Introduce OpenTelemetry Tracing (#6750 ) Add a new shared config stanza which all boulder components can use to configure their Open Telemetry tracing. This allows components to specify where their traces should be sent, what their sampling ratio should be, and whether or not they should respect their parent's sampling decisions (so that web front-ends can ignore sampling info coming from outside our infrastructure). It's likely we'll need to evolve this configuration over time, but this is a good starting point. Add basic Open Telemetry setup to our existing cmd.StatsAndLogging helper, so that it gets initialized at the same time as our other observability helpers. This sets certain default fields on all traces/spans generated by the service. Currently these include the service name, the service version, and information about the telemetry SDK itself. In the future we'll likely augment this with information about the host and process. Finally, add instrumentation for the HTTP servers and grpc clients/servers. This gives us a starting point of being able to monitor Boulder, but is fairly minimal as this PR is already somewhat unwieldy: It's really only enough to understand that everything is wired up properly in the configuration. In subsequent work we'll enhance those spans with more data, and add more spans for things not automatically traced here. Fixes https://github.com/letsencrypt/boulder/issues/6361 --------- Co-authored-by: Aaron Gable <aaron@aarongable.com>	2023-04-21 10:46:59 -07:00
Aaron Gable	45329c9472	Deprecate ROCSPStage7 flag (#6804 ) Deprecate the ROCSPStage7 feature flag, which caused the RA and CA to stop generating OCSP responses when issuing new certs and when revoking certs. (That functionality is now handled just-in-time by the ocsp-responder.) Delete the old OCSP-generating codepaths from the RA and CA. Remove the CA's internal reference to an OCSP implementation, because it no longer needs it. Additionally, remove the SA's "Issuers" config field, which was never used. Fixes #6285	2023-04-12 17:03:06 -07:00
Aaron Gable	d6cd589795	Simplify how gRPC services start, stop, and clean up (#6771 ) The CA, RA, and VA have multiple goroutines running alongside primary gRPC handling goroutine. These ancillary goroutines should be gracefully shut down when the process is about to exit. Historically, we have handled this by putting a call to each of these goroutine's shutdown function inside cmd.CatchSignals, so that when a SIGINT is received, all of the various cleanup routines happen in sequence. But there's a cleaner way to do it: just use defer! All of these cleanups need to happen after the primary gRPC server has fully shut down, so that we know they stick around at least as long as the service is handling gRPC requests. And when the service receives a SIGINT, cmd.CatchSignals will call the gRPC server's GracefulStop, which will cause the server's .Serve() to finally exit, which will cause start() to exit, which will cause main() to exit, which will cause all deferred functions to be run. In addition, remove filterShutdownErrors as the bug which made it necessary (.Serve() returning an error even when GracefulShutdown() is called) was fixed back in 2017. This allows us to call the start() function in a much more natural way, simply logging any error it returns instead of calling os.Exit(1) if it returns an error. This allows us to simplify the exit-handling code in these three services' main() functions, and lets us be a bit more idiomatic with our deferred cleanup functions. Part of #6794	2023-04-05 14:55:57 -07:00
Phil Porada	ce2ee69c5f	SARO: Add sa_lag_factor metric to assess usage of the lagFactor codepath (#6774 ) Add `sa_lag_retry` prometheus countervec metric with pass/fail dimensions for `GetOrder`, `GetAuthorization2`, and `GetRegistration` methods. The new metrics will appear as follows: ``` sa_lag_retry{method="GetOrder",result="found"} 0 sa_lag_retry{method="GetOrder",result="notfound"} 0 sa_lag_retry{method="GetOrder",result="other"} 0 sa_lag_retry{method="GetAuthorization2",result="found"} 0 sa_lag_retry{method="GetAuthorization2",result="notfound"} 0 sa_lag_retry{method="GetAuthorization2",result="other"} 0 sa_lag_retry{method="GetRegistration",result="found"} 0 sa_lag_retry{method="GetRegistration",result="notfound"} 0 sa_lag_retry{method="GetRegistration",result="other"} 0 ``` Fixes https://github.com/letsencrypt/boulder/issues/6773 --------- Co-authored-by: Samantha <hello@entropy.cat>	2023-03-30 13:48:16 -04:00
Matthew McPherrin	49851d7afd	Remove Beeline configuration (#6765 ) In a previous PR, #6733, this configuration was marked deprecated pending removal. Here is that removal.	2023-03-23 16:58:36 -04:00
Samantha	b2224eb4bc	config: Add validation tags to all configuration structs (#6674 ) - Require `letsencrypt/validator` package. - Add a framework for registering configuration structs and any custom validators for each Boulder component at `init()` time. - Add a `validate` subcommand which allows you to pass a `-component` name and `-config` file path. - Expose validation via exported utility functions `cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and `cmd.ValidateYAMLConfig()`. - Add unit test which validates all registered component configuration structs against test configuration files. Part of #6052	2023-03-21 14:08:03 -04:00
Matthew McPherrin	e1ed1a2ac2	Remove beeline tracing (#6733 ) Remove tracing using Beeline from Boulder. The only remnant left behind is the deprecated configuration, to ensure deployability. We had previously planned to swap in OpenTelemetry in a single PR, but that adds significant churn in a single change, so we're doing this as multiple steps that will each be significantly easier to reason about and review. Part of #6361	2023-03-14 15:14:27 -07:00
Matthew McPherrin	391a59921b	Move cmd.ConfigDuration to config.Duration (#6705 ) We rely on the ratelimit/ package in CI to validate our ratelimit configurations. However, because that package relies on cmd/ just for cmd.ConfigDuration, many additional dependencies get pulled in. This refactors just that struct to a separate config package. This was done using Goland's automatic refactoring tooling, which also organized a few imports while it was touching them, keeping standard library, internal and external dependencies grouped.	2023-02-28 08:11:49 -08:00
Samantha	a0fe7dc93e	SA: Remove Redis config (#6695 ) This field doesn't appear to be in use. Part of #6052	2023-02-27 09:29:38 -08:00
Aaron Gable	f9e4fb6c06	Add replication lag retries to some SA methods (#6649 ) Add a new time.Duration field, LagFactor, to both the SA's config struct and the read-only SA's implementation struct. In the GetRegistration, GetOrder, and GetAuthorization2 methods, if the database select returned a NoRows error and a lagFactor duration is configured, then sleep for lagFactor seconds and retry the select. This allows us to compensate for the replication lag between our primary write database and our read-only replica databases. Sometimes clients will fire requests in rapid succession (such as creating a new order, then immediately querying the authorizations associated with that order), and the subsequent requests will fail because they are directed to read replicas which are lagging behind the primary. Adding this simple sleep-and-retry will let us mitigate many of these failures, without adding too much complexity. Fixes #6593	2023-02-14 17:25:13 -08:00
Aaron Gable	d9cb35c60c	Remove unused DBConnect config string (#6615 ) Neither our testing, staging, nor production configs use the DBConfig.DBConnect config value. Remove it. To connect to a database, you have to provide a connection URL. These URLs often contain sensitive information such as DB usernames and passwords, so we don't store them directly in our configs -- instead, we store paths to files which contain these strings, and provision those files via a separate mechanism. We maintained the ability to provide a URL directly in the config for the sake of easy testing, but have not used it for that purpose for some time now.	2023-01-27 13:10:52 -08:00
Jacob Hoffman-Andrews	c23e59ba59	wfe2: don't pass through client-initiated cancellation (#6608 ) And clean up the code and tests that were used for cancellation pass-through. Fixes #6603	2023-01-26 17:26:15 -08:00
Aaron Gable	9e67423110	Create new StorageAuthorityReadOnly gRPC service (#6483 ) Create a new gRPC service named StorageAuthorityReadOnly which only exposes a read-only subset of the existing StorageAuthority service's methods. Implement this by splitting the existing SA in half, and having the read-write half embed and wrap an instance of the read-only half. Unfortunately, many of our tests use exported read-write methods as part of their test setup, so the tests are all being performed against the read-write struct, but they are exercising the same code as the read-only implementation exposes. Expose this new service at the SA on the same port as the existing service, but with (in config-next) different sets of allowed clients. In the future, read-only clients will be removed from the read-write service's set of allowed clients. Part of #6454	2022-11-09 11:09:12 -08:00
Aaron Gable	46c8d66c31	bgrpc.NewServer: support multiple services (#6487 ) Turn bgrpc.NewServer into a builder-pattern, with a config-based initialization, multiple calls to Add to add new gRPC services, and a final call to Build to produce the start() and stop() functions which control server behavior. All calls are chainable to produce compact code in each component's main() function. This improves the process of creating a new gRPC server in three ways: 1) It avoids the need for generics/templating, which was slightly verbose. 2) It allows the set of services to be registered on this server to be known ahead of time. 3) It greatly streamlines adding multiple services to the same server, which we use today in the VA and will be using soon in the SA and CA. While we're here, add a new per-service config stanza to the GRPCServerConfig, so that individual services on the same server can have their own configuration. For now, only provide a "ClientNames" key, which will be used in a follow-up PR. Part of #6454	2022-11-04 13:26:42 -07:00
Aaron Gable	9213bd0993	Streamline gRPC server creation (#6457 ) Collapse most of our boilerplate gRPC creation steps (in particular, creating default metrics, making the server and listener, registering the server, creating and registering the health service, filtering shutdown errors from the output, and gracefully stopping) into a single function in the existing bgrpc package. This allows all but one of our server main functions to drop their calls to NewServer and NewServerMetrics. To enable this, create a new helper type and method in the bgrpc package. Conceptually, this could be just a new function, but it must be attached to a new type so that it can be generic over the type of gRPC server being created. (Unfortunately, the grpc.RegisterFooServer methods do not accept an interface type for their second argument). The only main function which is not updated is the boulder-va, which is a special case because it creates multiple gRPC servers but (unlike the CA) serves them all on the same port with the same server and listener. Part of #6452	2022-10-26 15:45:52 -07:00
Samantha	78ea1d2c9d	SA: Use separate schema for incidents tables (#6350 ) - Move incidents tables from `boulder_sa` to `incidents_sa` (added in #6344) - Grant read perms for all tables in `incidents_sa` - Modify unit tests to account for new schema and grants - Add database cleaning func for `boulder_sa` - Adjust cleanup funcs to omit `sql-migrate` tables instead of `goose` Resolves #6328	2022-09-09 15:17:14 -07:00
Nina	ac0752ea53	sa: remove rocsp code (#6235 ) Fixes #6208	2022-07-18 13:31:26 -07:00
Andrew Gabbitas	79048cffba	Support writing initial OCSP response to redis (#5958 ) Adds a rocsp redis client to the sa if cluster information is provided in the sa config. If a redis cluster is configured, all new certificate OCSP responses added with sa.AddPrecertificate will attempt to be written to the redis cluster, but will not block or fail on errors. Fixes: #5871	2022-03-21 20:33:12 -06:00
Samantha	f69b57e0e1	Make DB client initialization uniform and stop setting 'READ-UNCOMMITTED' (#5741 ) Boulder components initialize their gorp and gorp-less (non-wrapped) database clients via two new SA helpers. These helpers handle client construction, database metric initialization, and (for gorp only) debug logging setup. Removes transaction isolation parameter `'READ-UNCOMMITTED'` from all database connections. Fixes #5715 Fixes #5889	2022-01-31 13:34:23 -08:00
Jacob Hoffman-Andrews	3bf06bb4d8	Export the config structs from our main files (#5875 ) This allows our documentation on those structs to show up in our godoc output.	2022-01-12 12:20:27 -08:00
Jacob Hoffman-Andrews	23dd1e21f9	Build all boulder binaries into a single binary (#5693 ) The resulting `boulder` binary can be invoked by different names to trigger the behavior of the relevant subcommand. For instance, symlinking and invoking as `boulder-ca` acts as the CA. Symlinking and invoking as `boulder-va` acts as the VA. This reduces the .deb file size from about 200MB to about 20MB. This works by creating a registry that maps subcommand names to `main` functions. Each subcommand registers itself in an `init()` function. The monolithic `boulder` binary then checks what name it was invoked with (`os.Args[0]`), looks it up in the registry, and invokes the appropriate `main`. To avoid conflicts, all of the old `package main` are replaced with `package notmain`. To get the list of registered subcommands, run `boulder --list`. This is used when symlinking all the variants into place, to ensure the set of symlinked names matches the entries in the registry. Fixes #5692	2021-10-20 17:05:45 -07:00
Samantha	99502b1ffb	oscp-updater: use rows.Scan() to get query results (#5656 ) - Replace `gorp.DbMap` with calls that use `sql.DB` directly - Use `rows.Scan()` and `rows.Next()` to get query results (which opens the door to streaming the results) - Export function `CertStatusMetadataFields` from `SA` - Add new function `ScanCertStatusRow` to `SA` - Add new function `NewDbSettingsFromDBConfig` to `SA` Fixes #5642 Part Of #5715	2021-10-18 10:33:09 -07:00
Aaron Gable	bab688b98f	Remove sa-wrappers.go (#5663 ) Remove the last of the gRPC wrapper files. In order to do so: - Remove the `core.StorageGetter` interface. Replace it with a new interface (whose methods include the `...grpc.CallOption` arg) inside the `sa/proto/` package. - Remove the `core.StorageAdder` interface. There's no real use-case for having a write-only interface. - Remove the `core.StorageAuthority` interface, as it is now redundant with the autogenerated `sapb.StorageAuthorityClient` interface. - Replace the `certificateStorage` interface (which appears in two different places) with a single unified interface also in `sa/proto/`. - Update all test mocks to include the `_ ...grpc.CallOption` arg in their method signatures so they match the gRPC client interface. - Delete many methods from mocks which are no longer necessary (mostly because they're mocking old authz1 methods that no longer exist). - Move the two `test/inmem/` wrappers into their own sub-packages to avoid an import cycle. - Simplify the `satest` package to satisfy one of its TODOs and to avoid an import cycle. - Add many methods to the `test/inmem/sa/` wrapper, to accommodate all of the methods which are called in unittests. Fixes #5600	2021-09-27 13:25:41 -07:00
J.C. Jones	7b31bdb30a	Add read-only dbConns to SQLStorageAuthority and OCSPUpdater (#5555 ) This changeset adds a second DB connect string for the SA for use in read-only queries that are not themselves dependencies for read-write queries. In other words, this is attempting to only catch things like rate-limit `SELECT`s and other coarse-counting, so we can potentially move those read queries off the read-write primary database. It also adds a second DB connect string to the OCSP Updater. This is a little trickier, as the subsequent `UPDATE`s _are_ dependent on the output of the `SELECT`, but in this case it's operating on data batches, and a few seconds' replication latency are several orders of magnitude below the threshold for update frequency, so any certificates that aren't caught on run `n` can be caught on run `n+1`. Since we export DB metrics to Prometheus, this also refactors `InitDBMetrics` to take a DB Address (host:port tuple) and User out of the DB connection DSN and include those as labels in the metrics. Fixes #5550 Fixes #4985	2021-08-02 11:21:34 -07:00
Jacob Hoffman-Andrews	2d2c723d34	Break the chain of cancellations at the SA (#5459 ) A recent mysql driver upgrade caused a performance regression. We believe this may be due to cancellations getting passed through to the database driver, which as of the upgrade will more aggressively tear down connections that experienced a cancellation. Also, we only recently started propagation cancellations all the way from the frontend in #5404. This makes it so the driver doesn't see the cancellation. Second attempt at #5447	2021-06-24 16:49:32 -07:00
Aaron Gable	9abb39d4d6	Honeycomb integration proof-of-concept (#5408 ) Add Honeycomb tracing to all Boulder components which act as HTTP servers, gRPC servers, or gRPC clients. Add many values which we currently emit to logs to the trace spans. Add a way to configure the Honeycomb integration to our config files, and by default configure all of our tests to "mute" (send nothing). Followup changes will refine the configuration, attempt to reduce the new dependency load, and introduce better sampling. Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218	2021-05-24 16:13:08 -07:00
Samantha	5a92926b0c	Remove dbconfig migration deployability code (#5348 ) Default boulder code paths to exclusively use the `db` config key Fixes #5338	2021-03-18 16:41:15 -07:00
Samantha	e2e7dad034	Move cmd.DBConfig fields to their own named sub-struct (#5286 ) Named field `DB`, in a each component configuration struct, acts as the receiver for the value of `db` when component JSON files are unmarshalled. When `cmd.DBConfig` fields are received at the root of component configuration struct instead of `DB` copy them to the `DB` field of the component configuration struct. Move existing `cmd.DBConfig` values from the root of each component's JSON configuration in `test/config-next` to `db` Part of #5275	2021-02-16 10:48:58 -08:00
Samantha	7cb0038498	Deprecate MaxDBConns for MaxOpenConns (#5274 ) In #5235 we replaced MaxDBConns in favor of MaxOpenConns. One week ago MaxDBConns was removed from all dev, staging, and production configurations. This change completes the removal of MaxDBConns from all components and test/config. Fixes #5249	2021-02-08 12:00:01 -08:00
Samantha	e0510056cc	Enhancements to SQL driver tuning via JSON config (#5235 ) Historically the only database/sql driver setting exposed via JSON config was maxDBConns. This change adds support for maxIdleConns, connMaxLifetime, connMaxIdleTime, and renames maxDBConns to maxOpenConns. The addition of these settings will give our SRE team a convenient method for tuning the reuse/closure of database connections. A new struct, DBSettings, has been added to SA. The struct, and each of it's fields has been commented. All new fields have been plumbed through to the relevant Boulder components and exported as Prometheus metrics. Tests have been added/modified to ensure that the fields are being set. There should be no loss in coverage Deployability concerns for the migration from maxDBConns to maxOpenConns have been addressed with the temporary addition of the helper method cmd.DBConfig.GetMaxOpenConns(). This method can be removed once test/config is defaulted to using maxOpenConns. Relevant sections of the code have TODOs added that link back to an newly opened issue. Fixes #5199	2021-01-25 15:34:55 -08:00
Aaron Gable	2d14cfb8d1	Add gRPC Health service to all Boulder services (#5093 ) This health service implements the gRPC Health Checking Protocol, as defined in https://github.com/grpc/grpc/blob/master/doc/health-checking.md and as implemented by the gRPC authors in https://pkg.go.dev/google.golang.org/grpc/health@v1.29.0 It simply instantiates a health service, and attaches it to the same gRPC server that is handling requests to the primary (e.g. CA) service. When the main service would be shut down (e.g. because it caught a signal), it also sets the status of the service to NOT_SERVING. This change also imports the health client into our grpc client, ensuring that all of our grpc clients use the health service to inform their load-balancing behavior. This will be used to replace our current usage of polling the debug port to determine whether a given service is up and running. It may also be useful for more comprehensive checks and blackbox probing in the future. Part of #5074	2020-10-06 12:14:02 -07:00
Daniel McCarney	0ecdf80709	SA: refactor DB stat collection & collect more stats. (#4096 ) Go 1.11+ updated the `sql.DBStats` struct with new fields that are of interest to us. This PR routes these stats to Prometheus by replacing the existing autoprom stats code with new first-class Prometheus metrics. Resolves https://github.com/letsencrypt/boulder/issues/4095 The `max_db_connections` stat from the SA is removed because the Go 1.11+ `sql.DBStats.MaxOpenConnections` field will give us a better view of the same information. The autoprom "reused_authz" stat that was being incremented in `SA.GetPendingAuthorization` was also removed. It wasn't doing what it says it was (counting reused authorizations) and was instead counting the number of times `GetPendingAuthorization` returned an authz.	2019-03-06 17:08:53 -08:00
Jacob Hoffman-Andrews	cdc01df24f	Revert "Increase default MaxIdleConns. (#3164 )" (#4007 ) This reverts commit `600640294d`, removing the maxIdleDBConns config setting.	2019-01-16 08:41:21 -05:00
Daniel McCarney	aa810a3142	gRPC: publish RPC latency stat in server interceptor. (#3665 ) We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it. This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram. Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands. A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder. Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.	2018-04-25 15:37:22 -07:00
Roland Bracewell Shoemaker	24cd01d033	Revert to setting full addresses instead of just ports	2018-04-23 12:39:28 -07:00
Roland Bracewell Shoemaker	5c4eaf841f	Review fixes	2018-04-20 16:03:55 -07:00
Roland Bracewell Shoemaker	ccb02419c5	Revert client changes + addr debug override	2018-04-20 12:46:33 -07:00
Roland Bracewell Shoemaker	d424d0580b	Allow cli override of gRPC listen and service addresses	2018-04-20 12:35:12 -07:00
Jacob Hoffman-Andrews	9da5a7e1fc	Cleanup: TLS and GRPC configs are mandatory. (#3476 ) Our various main.go functions gated some key code on whether the TLS and/or GRPC config fields were present. Now that those fields are fully deployed in production, we can simplify the code and require them. Also, rename tls to tlsConfig everywhere to avoid confusion with the tls package. Avoid assigning to the same err from two different goroutines in boulder-ca (fix a race).	2018-02-26 10:16:50 -05:00
Roland Bracewell Shoemaker	2a04a85c49	Export max DB connections in boulder-sa and ocsp-responder (#3388 ) Fixes #3387.	2018-01-24 09:11:01 -05:00
Jacob Hoffman-Andrews	68d5cc3331	Restore gRPC metrics (#3265 ) The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back. Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server. I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling. Also, update go-grpc-prometheus to get the necessary methods. ``` $ go test github.com/grpc-ecosystem/go-grpc-prometheus/... ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ```	2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews	600640294d	Increase default MaxIdleConns. (#3164 ) Go's default is 2: https://golang.org/src/database/sql/sql.go#L686. Graphs show we are opening 100-200 fresh connections per second on the SA. Changing this default should reduce that a lot, which should reduce load on both the SA and MariaDB. This should also improve latency, since every new TCP connection adds a little bit of latency.	2017-10-16 15:48:17 -07:00
Jacob Hoffman-Andrews	f366e45756	Remove global state from metrics gathering (#3167 ) Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux. In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use. One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.	2017-10-13 11:58:01 -07:00
Jacob Hoffman-Andrews	0a72f768a7	Remove ProfileCmd. (#3166 ) These stats are now all collected by Prometheus.	2017-10-13 10:02:04 -04:00
Jacob Hoffman-Andrews	dbfb48226d	Add parallelism to SA CountCertificatesByNames. (#3133 ) Since we can make up to 100 SQL queries from this method (based on the 100-SAN limit), sometimes it is too slow and we get a timeout for large certificates. By running some of those queries in parallel, we can speed things up and stop getting timeouts.	2017-10-02 15:45:08 -04:00
Jacob Hoffman-Andrews	4128e0d95a	Add time-dependent integration testing (#3060 ) Fixes #3020. In order to write integration tests for some features, especially related to rate limiting, rechecking of CAA, and expiration of authzs, orders, and certs, we need to be able to fake the passage of time in integration tests. To do so, this change switches out all clock.Default() instances for cmd.Clock(), which can be set manually with the FAKECLOCK environment variable. integration-test.py now starts up all servers once before the main body of tests, with FAKECLOCK set to a date 70 days ago, and does some initial setup for a new integration test case. That test case tries to fetch a 70-day-old authz URL, and expects it to 404. In order to make this work, I also had to change a number of our test binaries to shut down cleanly in response to SIGTERM. Without that change, stopping the servers between the setup phase and the main tests caused startservers.check() to fail, because some processes exited with nonzero status. Note: This is an initial stab at things, to prove out the technique. Long-term, I think we will want to use an idiom where test cases are classes that have a number of optional setup phases that may be run at e.g. 70 days prior and 5 days prior. This could help us avoid a proliferation of global state as we add more time-dependent test cases.	2017-09-13 12:34:14 -07:00

1 2 3

121 Commits