.. |
attention
|
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel (#16693)
|
2025-04-16 03:31:39 -07:00 |
core
|
[Attention] MLA with chunked prefill (#12639)
|
2025-02-21 15:30:12 -08:00 |
cpu
|
[Bugfix] fix gettid method is not define (#16084)
|
2025-04-08 19:12:44 -07:00 |
cutlass_extensions
|
[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
|
2025-03-27 00:54:44 +00:00 |
mamba
|
[BugFix] fix some typos found by typos. (#16314)
|
2025-04-09 03:43:59 -07:00 |
moe
|
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales (#16801)
|
2025-04-17 22:13:29 -07:00 |
prepare_inputs
|
[Misc][Easy] Annotate unused vars in the csrc files (#14798)
|
2025-03-15 12:40:09 +08:00 |
quantization
|
[Kernel] Add expert_map support to Cutlass FP8 MOE (#16861)
|
2025-04-21 20:44:32 -07:00 |
rocm
|
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830)
|
2025-04-21 20:46:22 -07:00 |
sparse/cutlass
|
[BugFix/Build] Fix sparse kernels not getting built on hopper (#14572)
|
2025-03-11 17:09:03 +00:00 |
activation_kernels.cu
|
[Kernel] Support MulAndSilu (#11624)
|
2025-01-15 02:29:53 +00:00 |
cache.h
|
[Attention] MLA with chunked prefill (#12639)
|
2025-02-21 15:30:12 -08:00 |
cache_kernels.cu
|
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros (#14347)
|
2025-03-18 05:50:19 -07:00 |
cuda_compat.h
|
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
|
2024-06-02 14:13:26 -07:00 |
cuda_utils.h
|
[Attention] MLA with chunked prefill (#12639)
|
2025-02-21 15:30:12 -08:00 |
cuda_utils_kernels.cu
|
[NVIDIA] Support nvfp4 quantization (#12784)
|
2025-02-12 19:51:51 -08:00 |
cuda_view.cu
|
[V1] Fully Transparent Implementation of CPU Offloading (#15354)
|
2025-03-31 20:22:34 +08:00 |
cumem_allocator.cpp
|
[core] improve error handling when wake up from sleep mode (#12981)
|
2025-02-10 09:38:57 +08:00 |
custom_all_reduce.cu
|
[Distributed] Add custom allreduce support for ROCM (#14125)
|
2025-03-31 22:49:12 -07:00 |
custom_all_reduce.cuh
|
fix: spelling (#16466)
|
2025-04-11 23:24:22 -07:00 |
custom_all_reduce_test.cu
|
[Distributed] Add custom allreduce support for ROCM (#14125)
|
2025-03-31 22:49:12 -07:00 |
dispatch_utils.h
|
dynamic distpatch of fp8 kernels (#14245)
|
2025-03-11 10:54:56 -04:00 |
layernorm_kernels.cu
|
[torch.compile] Fuse RMSNorm with quant (#9138)
|
2024-11-08 21:20:08 +00:00 |
layernorm_quant_kernels.cu
|
dynamic distpatch of fp8 kernels (#14245)
|
2025-03-11 10:54:56 -04:00 |
ops.h
|
[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173)
|
2025-04-11 06:50:50 -06:00 |
permute_cols.cu
|
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
|
2024-09-23 13:46:26 -04:00 |
pos_encoding_kernels.cu
|
[Kernel] Make rotary_embedding ops more flexible with input shape (#12777)
|
2025-02-06 08:46:13 -08:00 |
torch_bindings.cpp
|
[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173)
|
2025-04-11 06:50:50 -06:00 |
type_convert.cuh
|
[torch.compile] Fuse RMSNorm with quant (#9138)
|
2024-11-08 21:20:08 +00:00 |