[es] Creating first disruptoins.md file

This commit is contained in:
gamba47 2021-12-29 15:24:13 -03:00
parent ff4e472dcf
commit eab7b6cbdc
1 changed files with 276 additions and 0 deletions

View File

@ -0,0 +1,276 @@
---
reviewers:
- electrocucaracha
- raelga
title: Interrupciones
content_type: concept
weight: 60
---
<!-- overview -->
Esta guía es para los dueños de aplicaciones que quieren crear
aplicaciones con alta disponibilidad y que necesitan entender
que tipos de interrupciones pueden suceder en los Pods.
Tambien es para los administradores de clusters que quieren aplicacar acciones
automáticas en sus clusters, como actualizar o autoescalar los clusters.
<!-- body -->
## Interrupciones voluntarias e involuntarias
Los pods no desaparecen hasta que algo (una persona o un controlador) los destruye
ó hay problemas de hardware o del software que son inevitables.
Nosotros llamamos esos casos inevitables *interrupciones involuntarias* a
una aplicación. Algunos ejemplos:
- Una falla en en hardware de la máquina fisica del nodo
- Un administrador del cluster borra una VM (instancia) por error
- El proveedor cloud o el hypervisor falla y hace desaparecer la VM
- Un kernel panic
- El nodo desaparece del cluster por un problema de red que lo separa del cluster
- Una remoción del pod porque el nodo esta [fuera-de-recursos](/docs/concepts/scheduling-eviction/node-pressure-eviction/).
```No se si es correcto el termino remoción, porque la traducción sería desalojo pero no me gusta tanto```
A excepción de la condición fuera-de-recursos, todas estas condiciones
deben ser familiares para la mayoría de los usuarios, no son específicas
de Kubernetes
Nosotros llamamos a los otros casos *interrupciones voluntarias*. Estas incluyen
las acciones iniciadas por el dueño de la aplicación y aquellas iniciadas por el Administrador
del Cluster. Las acciones típicas de los dueños de la aplicación incluye:
- borrar el deployment o otro controlador que maneja el pod
- actualizar el deployment del pod que causa un reinicio
- borrar un pod (por ejemplo por accidente)
Las acciones del administrador del cluster incluyen:
- [Drenar un nodo](/docs/tasks/administer-cluster/safely-drain-node/) para reparar o actualizar.
- Drenar un nodo del cluster para reducir el cluster (aprenda acerca de [Autoescalamiento de Cluster](https://github.com/kubernetes/autoscaler/#readme)
).
- Remover un pod de un nodo para permitir que otra cosa pueda ingresar a ese nodo.
Estas acciones pueden ser realizadas directamente por el administrador del cluster, por
tareas automaticas del administrador del cluster ó por el proveedor del cluster.
```Si puden revisar esta frase de arriba sería muy bueno, no me gusta como ha quedado```
Consulte al administrador de su cluster, a su proveedor cloud ó la documentación de su distribución
para determinar si alguno de estas interrupciones voluntarias están habilitadas en su cluster.
Si no estan habilitadas, puede saltear la creación del presupuesto de Interrupción de Pods.
```Si puden revisar esta frase de arriba sería muy bueno, no me gusta como ha quedado```
{{< caution >}}
Not all voluntary disruptions are constrained by Pod Disruption Budgets. For example,
deleting deployments or pods bypasses Pod Disruption Budgets.
{{< /caution >}}
## Tratando con las interrupciones
Estas son algunas de las maneras para mitigar las interrupciones involuntarias:
- Asegurarse que el pod [solicite los recursos](/docs/tasks/configure-pod-container/assign-memory-resource) que necesita.
- Replique su aplicación su usted necesita alta disponiblidad. (Aprenda sobre correr aplicaciones replicadas
[stateless](/docs/tasks/run-application/run-stateless-application-deployment/)
y [stateful](/docs/tasks/run-application/run-replicated-stateful-application/) applications.)
- Incluso, para alta disponiblidad cuando corre aplicaciones replicadas,
extendia aplicaciones por varios racks (usando
[anti-affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity))
o usando zonas (si usa un [cluster multi-zona](/docs/setup/multiple-zones).)
La frecuencia de las interrupciones voluntarias varía. En un cluster basico de Kubernetes, no hay
interrupciones voluntarias automáticas (solo el usuario las genera). Sin embargo, su administrador del cluster o proveedor de hosting
puede correr algun servicio adicional que pueda causar estas interrupciones voluntarias. Por ejemplo,
desplegando una actualización de software en los nodos puede causar interrupciones. También, algunas implementaciones
de clusters con autoescalamiento de nodos puede causar interrupciones para defragmentar o compactar los nodos.
Su administrador de cluster o proveedor de hosting debe tener documentado cual es el nivel de interrupciones
voluntarias esperadas, si las hay. Ciertas opciones de configuración, como ser
[usar PriorityClasses](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
en las spec de su pod pueden tambien causar interrupciones voluntarias (o involuntarias).
## Pod disruption budgets
{{< feature-state for_k8s_version="v1.21" state="stable" >}}
Kubernetes offers features to help you run highly available applications even when you
introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application.
A PDB limits the number of Pods of a replicated application that are down simultaneously from
voluntary disruptions. For example, a quorum-based application would
like to ensure that the number of replicas running is never brought below the
number needed for a quorum. A web front end might want to
ensure that the number of replicas serving load never falls below a certain
percentage of the total.
Cluster managers and hosting providers should use tools which
respect PodDisruptionBudgets by calling the [Eviction API](/docs/tasks/administer-cluster/safely-drain-node/#eviction-api)
instead of directly deleting pods or deployments.
For example, the `kubectl drain` subcommand lets you mark a node as going out of
service. When you run `kubectl drain`, the tool tries to evict all of the Pods on
the Node you're taking out of service. The eviction request that `kubectl` submits on
your behalf may be temporarily rejected, so the tool periodically retries all failed
requests until all Pods on the target node are terminated, or until a configurable timeout
is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a `.spec.replicas: 5` is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time,
then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same
as the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the `.spec.replicas` of the workload resource
that is managing those pods. The control plane discovers the owning workload resource by
examining the `.metadata.ownerReferences` of the Pod.
[Involuntary disruptions](#voluntary-and-involuntary-disruptions) cannot be prevented by PDBs; however they
do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count
against the disruption budget, but workload resources (such as Deployment and StatefulSet)
are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures
during application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully
[terminated](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination), honoring the
`terminationGracePeriodSeconds` setting in its [PodSpec](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#podspec-v1-core).
## PodDisruptionBudget example {#pdb-example}
Consider a cluster with 3 nodes, `node-1` through `node-3`.
The cluster is running several applications. One of them has 3 replicas initially called
`pod-a`, `pod-b`, and `pod-c`. Another, unrelated pod without a PDB, called `pod-x`, is also shown.
Initially, the pods are laid out as follows:
| node-1 | node-2 | node-3 |
|:--------------------:|:-------------------:|:------------------:|
| pod-a *available* | pod-b *available* | pod-c *available* |
| pod-x *available* | | |
All 3 pods are part of a deployment, and they collectively have a PDB which requires
there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix a bug in the kernel.
The cluster administrator first tries to drain `node-1` using the `kubectl drain` command.
That tool tries to evict `pod-a` and `pod-x`. This succeeds immediately.
Both pods go into the `terminating` state at the same time.
This puts the cluster in this state:
| node-1 *draining* | node-2 | node-3 |
|:--------------------:|:-------------------:|:------------------:|
| pod-a *terminating* | pod-b *available* | pod-c *available* |
| pod-x *terminating* | | |
The deployment notices that one of the pods is terminating, so it creates a replacement
called `pod-d`. Since `node-1` is cordoned, it lands on another node. Something has
also created `pod-y` as a replacement for `pod-x`.
(Note: for a StatefulSet, `pod-a`, which would be called something like `pod-0`, would need
to terminate completely before its replacement, which is also called `pod-0` but has a
different UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
Now the cluster is in this state:
| node-1 *draining* | node-2 | node-3 |
|:--------------------:|:-------------------:|:------------------:|
| pod-a *terminating* | pod-b *available* | pod-c *available* |
| pod-x *terminating* | pod-d *starting* | pod-y |
At some point, the pods terminate, and the cluster looks like this:
| node-1 *drained* | node-2 | node-3 |
|:--------------------:|:-------------------:|:------------------:|
| | pod-b *available* | pod-c *available* |
| | pod-d *starting* | pod-y |
At this point, if an impatient cluster administrator tries to drain `node-2` or
`node-3`, the drain command will block, because there are only 2 available
pods for the deployment, and its PDB requires at least 2. After some time passes, `pod-d` becomes available.
The cluster state now looks like this:
| node-1 *drained* | node-2 | node-3 |
|:--------------------:|:-------------------:|:------------------:|
| | pod-b *available* | pod-c *available* |
| | pod-d *available* | pod-y |
Now, the cluster administrator tries to drain `node-2`.
The drain command will try to evict the two pods in some order, say
`pod-b` first and then `pod-d`. It will succeed at evicting `pod-b`.
But, when it tries to evict `pod-d`, it will be refused because that would leave only
one pod available for the deployment.
The deployment creates a replacement for `pod-b` called `pod-e`.
Because there are not enough resources in the cluster to schedule
`pod-e` the drain will again block. The cluster may end up in this
state:
| node-1 *drained* | node-2 | node-3 | *no node* |
|:--------------------:|:-------------------:|:------------------:|:------------------:|
| | pod-b *terminating* | pod-c *available* | pod-e *pending* |
| | pod-d *available* | pod-y | |
At this point, the cluster administrator needs to
add a node back to the cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions
can happen, according to:
- how many replicas an application needs
- how long it takes to gracefully shutdown an instance
- how long it takes a new instance to start up
- the type of controller
- the cluster's resource capacity
## Separating Cluster Owner and Application Owner Roles
Often, it is useful to think of the Cluster Manager
and Application Owner as separate roles with limited knowledge
of each other. This separation of responsibilities
may make sense in these scenarios:
- when there are many application teams sharing a Kubernetes cluster, and
there is natural specialization of roles
- when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an
interface between the roles.
If you do not have such a separation of responsibilities in your organization,
you may not need to use Pod Disruption Budgets.
## How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all
the nodes in your cluster, such as a node or system software upgrade, here are some options:
- Accept downtime during the upgrade.
- Failover to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes
and for human effort to orchestrate the switchover.
- Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary
disruptions largely overlaps with work to support autoscaling and tolerating
involuntary disruptions.
## {{% heading "whatsnext" %}}
* Follow steps to protect your application by [configuring a Pod Disruption Budget](/docs/tasks/run-application/configure-pdb/).
* Learn more about [draining nodes](/docs/tasks/administer-cluster/safely-drain-node/)
* Learn about [updating a deployment](/docs/concepts/workloads/controllers/deployment/#updating-a-deployment)
including steps to maintain its availability during the rollout.