docs: Documentation Deployment Best Practices (#126)

Signed-off-by: zhaoxinxin <1186037180@qq.com>
This commit is contained in:
Zhaoxinxin 2024-07-12 11:27:29 +08:00 committed by GitHub
parent 08ee4b95df
commit cefead2d17
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
11 changed files with 395 additions and 77 deletions

View File

@ -26,39 +26,6 @@ change log level to info
> The change log level event will print in stdout and `core.log` file, but if the level is greater than `info`, stdout only.
## Download slower than without Dragonfly {#download-slower-than-without-dragonfly}
**1.** Confirm limit rate.
Set `download.rateLimit` and `upload.rateLimit` in the [dfdaemon.yaml](./reference/configuration/client/dfdaemon.md)
configuration file.
**2.** Increase the number of concurrent piece.
Set `download.concurrentPieceCount` in the [dfdaemon.yaml](./reference/configuration/client/dfdaemon.md)
configuration file.
```yaml
download:
server:
# -- socketPath is the unix socket path for dfdaemon GRPC service.
socketPath: /var/run/dragonfly/dfdaemon.sock
# -- rateLimit is the default rate limit of the download speed in bps(bytes per second), default is 20Gbps.
rateLimit: 20000000000
# -- pieceTimeout is the timeout for downloading a piece from source.
pieceTimeout: 30s
# -- concurrentPieceCount is the number of concurrent pieces to download.
concurrentPieceCount: 10
upload:
server:
# -- port is the port to the grpc server.
port: 4000
## ip is the listen ip of the grpc server.
# ip: ""
# -- rateLimit is the default rate limit of the upload speed in bps(bytes per second), default is 20Gbps.
rateLimit: 20000000000
```
## 500 Internal Server Error {#500-internal-server-error}
**1.** Check error logs in /var/log/dragonfly/dfdaemon/

View File

@ -4,4 +4,208 @@ title: Deployment Best Practices
slug: /operations/best-practices/deployment-best-practices/
---
## TODO
Documentation for setting capacity planning and performance tuning for Dragonfly.
## Capacity Planning
A big factor in planning capacity is: highest expected storage capacity.
And know the memory size, CPU core count, and disk capacity of each machine.
For predicting your capacity, you can use the estimates from below if you dont have your capacity plan.
### Manager
The resources required to deploy the Manager depends on the total number of peers.
> Run a minimum of 3 replicas.
<!-- markdownlint-disable -->
| Total Number of Peers | CPU | Memory | Disk |
| --------------------- | --- | ------ | ----- |
| 1K | 8C | 16G | 200Gi |
| 5K | 16C | 32G | 200Gi |
| 10K | 16C | 64G | 200Gi |
<!-- markdownlint-restore -->
### Scheduler
The resources required to deploy the Scheduler depends on the request per second.
> Run a minimum of 3 replicas.
<!-- markdownlint-disable -->
| Request Per Second | CPU | Memory | Disk |
| ------------------ | --- | ------ | ----- |
| 1K | 8C | 16G | 200Gi |
| 3K | 16C | 32G | 200Gi |
| 5K | 32C | 64G | 200Gi |
<!-- markdownlint-restore -->
### Client
<!-- markdownlint-disable -->
The resources required to deploy the Client depends on the request per second.
> If it is a Seed Peer, run a minimum of 3 replicas. Disk are calculated based on file storage capacity.
| Request Per Second | CPU | Memory | Disk |
| ------------------ | --- | ------ | ----- |
| 500 | 8C | 16G | 500Gi |
| 1K | 8C | 16G | 3Ti |
| 3K | 16C | 32G | 5Ti |
| 5K | 32C | 64G | 10Ti |
<!-- markdownlint-restore -->
### Cluster
The resources required to deploy each service in a P2P cluster depends on the total number of Peers.
<!-- markdownlint-disable -->
| Total Number of Peers | Manager | Scheduler | Seed Peer | Peer |
| --------------------- | ------------------ | ------------------ | ----------------- | ----------- |
| 500 | 4C/8G/200Gi \* 3 | 8C/16G/200Gi \* 3 | 8C/16G/1Ti \* 3 | 4C/8G/500Gi |
| 1K | 8C/16G/200Gi \* 3 | 8C/16G/200Gi \* 3 | 8C/16G/3Ti \* 3 | 4C/8G/500Gi |
| 3K | 16C/32G/200Gi \* 3 | 16C/32G/200Gi \* 3 | 16C/32G/5Ti \* 3 | 4C/8G/500Gi |
| 5K | 16C/64G/200Gi \* 3 | 32C/64G/200Gi \* 3 | 32C/64G/10Ti \* 3 | 4C/8G/500Gi |
<!-- markdownlint-restore -->
## Performance tuning
The following documentation may help you to achieve better performance especially for large scale runs.
### Rate limits
#### Outbound Bandwidth
Used for node P2P to share piece bandwidth.
If the peak bandwidth is greater than the default outbound bandwidth,
you can set `rateLimit` higher to increase the upload speed.
It is recommended that the configuration be the same as the inbound bandwidth of the machine.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
upload:
# -- rateLimit is the default rate limit of the upload speed in bps(bytes per second), default is 20Gbps.
rateLimit: 20000000000
```
#### Inbound Bandwidth
Used for node back-to-source bandwidth and download bandwidth from remote peer.
If the peak bandwidth is greater than the default inbound bandwidth,
`rateLimit` can be set higher to increase download speed.
It is recommended that the configuration be the same as the outbound bandwidth of the machine.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
download:
# -- rateLimit is the default rate limit of the download speed in bps(bytes per second), default is 20Gbps.
rateLimit: 20000000000
```
### Concurrency control
When used to download a single task of a node
the number of concurrent downloads of piece back-to-source and the number of concurrent downloads of piece from remote peer.
The larger the number of piece concurrency, the faster the task download, and the more CPU and memory will be consumed.
The user adjusts the number of piece concurrency according to the actual situation.
and adjust the clients CPU and memory configuration.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
download:
# -- concurrentPieceCount is the number of concurrent pieces to download.
concurrentPieceCount: 10
```
### GC
Used for task cache GC in node disk, taskTTL is calculated based on cache time.
To avoid cases where GC would be problematic or potentially catastrophi,
it is recommended to use the default value.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
gc:
# interval is the interval to do gc.
interval: 900s
policy:
# taskTTL is the ttl of the task.
taskTTL: 21600s
# distHighThresholdPercent is the high threshold percent of the disk usage.
# If the disk usage is greater than the threshold, dfdaemon will do gc.
distHighThresholdPercent: 80
# distLowThresholdPercent is the low threshold percent of the disk usage.
# If the disk usage is less than the threshold, dfdaemon will stop gc.
distLowThresholdPercent: 60
```
### Nydus
When Nydus downloads a file, it splits the file into 1MB chunks and loads them on demand.
Use Seed Peer HTTP proxy as Nydus cache service,
use P2P transmission method to reduce back-to-source requests and back-to-source traffic,
and improve download speed.
When Dragonfly is used as a cache service for Nydus, the configuration needs to be optimized.
**1.** `proxy.rules.regex` matches the Nydus repository URL,
intercepts download traffic and forwards it to the P2P network.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
proxy:
# rules is the list of rules for the proxy server.
# regex is the regex of the request url.
# useTLS indicates whether use tls for the proxy backend.
# redirect is the redirect url.
# filteredQueryParams is the filtered query params to generate the task id.
# When filter is ["Signature", "Expires", "ns"], for example:
# http://example.com/xyz?Expires=e1&Signature=s1&ns=docker.io and http://example.com/xyz?Expires=e2&Signature=s2&ns=docker.io
# will generate the same task id.
# Default value includes the filtered query params of s3, gcs, oss, obs, cos.
rules:
- regex: 'blobs/sha256.*'
# useTLS: false
# redirect: ""
# filteredQueryParams: []
```
**2.** Change `Seed Peer Load Limit` to 10000 or higher to improve the P2P cache hit rate between Seed Peers.
Click the `UPDATE CLUSTER` button to change the `Seed Peer Load Limit` to 10000.
Please refer to [update-cluster](https://d7y.io/docs/next/advanced-guides/web-console/cluster/#update-cluster).
![update-cluster](../../resource/operations/best-practices/deployment-best-practices/update-cluster.png)
Changed `Seed Peer Load Limit` successfully.
![cluster](../../resource/operations/best-practices/deployment-best-practices/cluster.png)
**3.** Nydus will initiate an HTTP range request of about 1MB to achieve on-demand loading.
When prefetch enabled, the Seed Peer can prefetch the complete resource after receiving the HTTP range request,
improving the cache hit rate.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
proxy:
# prefetch pre-downloads full of the task when download with range request.
prefetch: true
```
**4.** When the download speed is slow,
you can adjust the `readBufferSize` value of proxy to 64KB in order to reduce the proxy request time.
Please refer to [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md).
```yaml
proxy:
# -- readBufferSize is the buffer size for reading piece from disk, default is 32KB.
readBufferSize: 32768
```

View File

@ -100,7 +100,7 @@ The content storage address of Git LFS is `github-cloud.githubusercontent.com`.
#### Dragonfly Kubernetes Cluster Setup {#dragonfly-kubernetes-cluster-setup}
For detailed installation documentation based on kubernetes cluster, please refer to [quick-start-kubernetes](https://d7y.io/docs/next/getting-started/quick-start/kubernetes/).
For detailed installation documentation based on kubernetes cluster, please refer to [quick-start-kubernetes](../../getting-started/quick-start/kubernetes.md).
##### Setup kubernetes cluster

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

View File

@ -36,10 +36,10 @@ Document:
AI Infrastructure:
- Supports Triton Inference Server to accelerate model distribution, refer to [dragonfly-repository-agent](https://github.com/dragonflyoss/dragonfly-repository-agent).
- Supports TorchServer to accelerate model distribution, refer to [document](https://d7y.io/docs/next/setup/integration/torchserve).
- Supports HuggingFace to accelerate model distribution and dataset distribution, refer to [document](https://d7y.io/docs/next/setup/integration/hugging-face).
- Supports Git LFS to accelerate file distribution, refer to [document](https://d7y.io/docs/next/setup/integration/git-lfs).
- Supports Triton Inference Server to accelerate model distribution, refer to [dragonfly-repository-agent](../operations/integrations/triton-server.md).
- Supports TorchServer to accelerate model distribution, refer to [document](../operations/integrations/torchserve.md).
- Supports HuggingFace to accelerate model distribution and dataset distribution, refer to [document](../operations/integrations/hugging-face.md).
- Supports Git LFS to accelerate file distribution, refer to [document](../operations/integrations/git-lfs.md).
- Supports JuiceFS to accelerate file downloads from object storage, JuiceFS read requests via
peer proxy and write requests via the default client of object storage.
- Supports Fluid to accelerate model distribution.

View File

@ -26,39 +26,6 @@ change log level to info
> 修改日志级别的事件将记录在标准输出和 `core.log` 中,但是如果修改的级别高于 `info` 的话,则仅有标准输出
## 下载速度比不用蜻蜓的时候慢
**1.** 确认限速值是否合适。
在 [dfdaemon.yaml](./reference/configuration/client/dfdaemon.md)
配置文件下设置 `download.rateLimit``upload.rateLimit`
**2.** 增加 Piece 并发数。
在 [dfdaemon.yaml](./reference/configuration/client/dfdaemon.md)
配置文件下设置 `download.concurrentPieceCount`
```yaml
download:
server:
# socketPath 是 dfdaemon GRPC 服务的 unix 套接字路径。
socketPath: /var/run/dragonfly/dfdaemon.sock
# 下载速度的默认速率限制,单位为 bps字节每秒默认为 20Gbps。
rateLimit: 20000000000
# 从源下载 piece 的超时时间。
pieceTimeout: 30s
# 下载 piece 的并发数量。
concurrentPieceCount: 10
upload:
server:
# grpc 服务器的端口。
port: 4000
## GRPC 服务器的监听 ip。
# ip: ""
# 上传速度的默认速率限制,单位为 bps字节每秒默认为 20Gbps。
rateLimit: 20000000000
```
## 500 Internal Server Error
**1.** 检查日志 /var/log/dragonfly/dfdaemon/

View File

@ -4,4 +4,184 @@ title: 部署
slug: /operations/best-practices/deployment-best-practices/
---
## TODO
本文档概述了我们计划如何为 Dragonfly 设置容量规划和性能优化。
## 容量规划
规划容量时需要考虑的一个重要因素是最高预期存储容量。并且你需要对目前所拥有的机器的内存大小、CPU 核心数、磁盘容量有清晰的了解。
如果您没有明确的容量规划,可以使用下面的估算值来预测您的容量。
### Manager
部署 Manager 会根据 Peer 数量来估算所使用的资源量。
> 推荐运行至少 3 个副本。
<!-- markdownlint-disable -->
| Total Number of Peers | CPU | Memory | Disk |
| --------------------- | --- | ------ | ----- |
| 1K | 8C | 16G | 200Gi |
| 5K | 16C | 32G | 200Gi |
| 10K | 16C | 64G | 200Gi |
<!-- markdownlint-restore -->
### Scheduler
部署 Scheduler 会根据每秒下载请求数来估算所使用的资源量。
> 推荐运行至少 3 个副本。
<!-- markdownlint-disable -->
| Request Per Second | CPU | Memory | Disk |
| ------------------ | --- | ------ | ----- |
| 1K | 8C | 16G | 200Gi |
| 3K | 16C | 32G | 200Gi |
| 5K | 32C | 64G | 200Gi |
<!-- markdownlint-restore -->
### Client
部署 Client 会根据每秒下载请求数来估算所使用的资源量。
> 如果是 Seed Peer 推荐运行至少 3 个副本。Disk 大小根据不同文件存储容量计算。
<!-- markdownlint-disable -->
| Request Per Second | CPU | Memory | Disk |
| ------------------ | --- | ------ | ----- |
| 500 | 8C | 16G | 500Gi |
| 1K | 8C | 16G | 3Ti |
| 3K | 16C | 32G | 5Ti |
| 5K | 32C | 64G | 10Ti |
<!-- markdownlint-restore -->
### Cluster
部署 P2P 集群会根据 Peer 数量来估算每个服务所使用的资源量。
<!-- markdownlint-disable -->
| Total Number of Peers | Manager | Scheduler | Seed Peer | Peer |
| --------------------- | ------------------ | ------------------ | ----------------- | ----------- |
| 500 | 4C/8G/200Gi \* 3 | 8C/16G/200Gi \* 3 | 8C/16G/1Ti \* 3 | 4C/8G/500Gi |
| 1K | 8C/16G/200Gi \* 3 | 8C/16G/200Gi \* 3 | 8C/16G/3Ti \* 3 | 4C/8G/500Gi |
| 3K | 16C/32G/200Gi \* 3 | 16C/32G/200Gi \* 3 | 16C/32G/5Ti \* 3 | 4C/8G/500Gi |
| 5K | 16C/64G/200Gi \* 3 | 32C/64G/200Gi \* 3 | 32C/64G/10Ti \* 3 | 4C/8G/500Gi |
<!-- markdownlint-restore -->
## 性能调优
以下内容可以帮助你实现更好的性能,特别是对于大规模运行。
### 速率限制
#### 上行带宽
主要作用节点 P2P 分享 Piece 的带宽。当峰值带宽大于默认上行带宽,可以将 `rateLimit` 设置更高,提升上传速度。在不影响其他服务情况下,建议配置和机器下行带宽相同,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
upload:
# 上传速度的默认速率限制,单位为 bps字节每秒默认为 20Gbps。
rateLimit: 20000000000
```
#### 下行带宽
主要作用节点回源带宽和从 Remote Peer 下载带宽。当峰值带宽大于默认上行带宽,可以将 `rateLimit` 设置更高,提升下载速度。在不影响其他服务情况下,建议配置和机器上行带宽相同,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
download:
# 下载速度的默认速率限制,单位为 bps字节每秒默认为 20Gbps。
rateLimit: 20000000000
```
### 并发控制
主要作用节点单个任务下载时,回源下载 Piece 并发数和从 Remote Peer 下载 Piece 并发数。默认并发数为 10可根据机器配置调整。
Piece 并发数越大,任务下载越快,但是消耗 CPU 以及 Memory 会更多。用户根据实际情况在调整 Piece 并发数的同时,
调整 Client 的 CPU 以及 Memory 配置,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
download:
# 下载 piece 的并发数量。
concurrentPieceCount: 10
```
### GC
主要作用节点磁盘中的 Task 缓存 GCtaskTTL 根据用户实际需要缓存时间评估。
为了避免 GC 出现问题或可能造成灾难性后果的情况,`distHighThresholdPercent` 和 `distLowThresholdPercent` 建议使用默认值,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
gc:
# 进行 GC 的时间间隔。
interval: 900s
policy:
# 不活跃的 task 的存活时间。
taskTTL: 21600s
# 磁盘使用率的高阈值百分比。
# 如果磁盘使用率大于阈值dfdaemon 将执行 GC。
distHighThresholdPercent: 80
# 磁盘使用率的下阈值百分比。
# 如果磁盘使用率低于阈值dfdaemon 将停止 GC。
distLowThresholdPercent: 60
```
### Nydus
当 Nydus 下载镜像或文件的时候Nydus 会将 Task 切分成 1MB 左右的 Chunk 请求按需加载。
在使用 Seed Peer 的 HTTP Proxy 作为 Nydus 的缓存服务时,利用 P2P 的传输方式减少回源请求以及回源带宽,进一步提升下载速度。
Dragonfly 作为 Nydus 的缓存服务,部署 ManagerScheduler 以及 Seed Peer 即可,配置需要有一定优化。
**1.** `proxy.rules.regex` 正则匹配 Nydus 存储仓库 URL截获下载流量转发到 P2P 网络中,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
proxy:
# rules 是代理服务器的正则列表。
# regex 是请求 URL 的正则表达式。
# useTLS 指示代理后端是否使用 TLS。
# redirect 是重定向 URL。
# FilteredQueryParams 是过滤后的查询参数,用于生成任务 ID。
# 当过滤器为["Signature", "Expires", "ns"]时,例如:
# http://example.com/xyz?Expires=e1&Signature=s1&ns=docker.io 和 http://example.com/xyz?Expires=e2&Signature=s2&ns=docker.io
# 将生成相同的任务 ID.
# 默认值包括过滤后的s3、gcs、oss、obs、cos的查询参数。
rules:
- regex: 'blobs/sha256.*'
# useTLS: false
# redirect: ""
# filteredQueryParams: []
```
**2.** 推荐 `Seed Peer Load Limit` 改成 10000 或者更高,提高 Seed Peer 之间的 P2P 缓存命中率。
点击 `UPDATE CLUSTER` 按钮更改 `Seed Peer Load Limit` 为 10000。详情参考 [update-cluster](https://d7y.io/docs/next/advanced-guides/web-console/cluster/#update-cluster)。
![update-cluster](../../resource/operations/best-practices/deployment-best-practices/update-cluster.png)
更改 `Seed Peer Load Limit` 成功。
![cluster](../../resource/operations/best-practices/deployment-best-practices/cluster.png)
**3.** Nydus 会发起 1MB 左右的 HTTP Range 请求实现按需加载,开启 Prefetch 的情况下 Seed Peer 可以在接受到 HTTP Range 请求后预取完整的资源,
提高缓存命中率,详情参考 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
proxy:
# 当请求使用 Range Header请求部分数据的时候可以预先获取非 Range 内的数据。
prefetch: true
```
**4.** 下载速度较慢时,可以调整 Proxy 的 `readBufferSize` 值,适当调至为 64KB目的是为了减少 Proxy 请求时间,参考文档 [dfdaemon.yaml](../../reference/configuration/client/dfdaemon.md)。
```yaml
proxy:
# 从磁盘读取 piece 的缓冲区大小默认为32KB。
readBufferSize: 32768
```

View File

@ -35,10 +35,10 @@ Document:
AI Infrastructure:
- 支持 Triton Inference Server 加速模型分发,参考文档 [Triton Inference Server](https://github.com/dragonflyoss/dragonfly-repository-agent)。
- 支持 TorchServer 加速模型分发,参考文档 [TorchServe](https://d7y.io/docs/next/setup/integration/torchserve)。
- 支持 HuggingFace 加速模型和数据集分发,参考文档 [Hugging Face](https://d7y.io/docs/next/setup/integration/hugging-face)。
- 支持 Git LFS 加速文件分发,参考文档 [Git LFS](https://d7y.io/docs/next/setup/integration/git-lfs)。
- 支持 Triton Inference Server 加速模型分发,参考文档 [Triton Inference Server](../operations/integrations/triton-server.md)。
- 支持 TorchServer 加速模型分发,参考文档 [TorchServe](../operations/integrations/torchserve.md)。
- 支持 HuggingFace 加速模型和数据集分发,参考文档 [Hugging Face](../operations/integrations/hugging-face.md)。
- 支持 Git LFS 加速文件分发,参考文档 [Git LFS](../operations/integrations/git-lfs.md)。
- 支持 JuiceFS 加速文件分发Dragonfly 作为 JuiceFS 和对象存储中间 Cache 层。
- 支持 Fluid 加速分发模型。
- 支持更多 AI 基础设施高效分发模型以及数据集,与 AI 生态融合。