drbh
commited on
feat: add bench image and update readme
Browse files
README.md
CHANGED
|
@@ -25,6 +25,16 @@ oooo ooo .oooo. ooo. .oo. .oo. .ooooo. .ooooo.
|
|
| 25 |
- low memory usage: optimized to handle large batch sizes
|
| 26 |
- reproducibility: easy to reproduce results, no special new `sm` requirements
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
### How to use
|
| 30 |
|
|
@@ -117,4 +127,4 @@ peak_mem_mb = torch.cuda.max_memory_allocated() / (1024 * 1024)
|
|
| 117 |
print(f"Output: sum={output.sum().item():.1f}, min={output.min().item():.1f}, max={output.max().item():.1f}")
|
| 118 |
print(f"First 3: {output.view(-1)[:3].tolist()}")
|
| 119 |
print(f"Time: {elapsed_ms:.1f}ms, Memory: {peak_mem_mb:.0f}MB")
|
| 120 |
-
```
|
|
|
|
| 25 |
- low memory usage: optimized to handle large batch sizes
|
| 26 |
- reproducibility: easy to reproduce results, no special new `sm` requirements
|
| 27 |
|
| 28 |
+
### Performance
|
| 29 |
+
|
| 30 |
+
`yamoe` scales well as batch sizes increase in comparision to the naive method of repeating the data and computation for every item in the batch as shown in the reference in [torch-ext/yamoe/reference.py](torch-ext/yamoe/reference.py). This bench can be reproduced by running `uv run perf_plot.py` or a smaller bench and correctness comparision can be run with `uv run compare_example.py`
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
TLDR: smaller is better on the first two rows of charts
|
| 34 |
+
|
| 35 |
+
<img width="3583" height="2358" alt="moe_performance_comparison" src="https://github.com/user-attachments/assets/72938f64-ec05-4eaa-82c4-507a43891543" />
|
| 36 |
+
|
| 37 |
+
|
| 38 |
|
| 39 |
### How to use
|
| 40 |
|
|
|
|
| 127 |
print(f"Output: sum={output.sum().item():.1f}, min={output.min().item():.1f}, max={output.max().item():.1f}")
|
| 128 |
print(f"First 3: {output.view(-1)[:3].tolist()}")
|
| 129 |
print(f"Time: {elapsed_ms:.1f}ms, Memory: {peak_mem_mb:.0f}MB")
|
| 130 |
+
```
|