drbh commited on
Commit
0daa7ef
·
unverified ·
1 Parent(s): a9b8fe6

feat: add bench image and update readme

Browse files
Files changed (1) hide show
  1. README.md +11 -1
README.md CHANGED
@@ -25,6 +25,16 @@ oooo ooo .oooo. ooo. .oo. .oo. .ooooo. .ooooo.
25
  - low memory usage: optimized to handle large batch sizes
26
  - reproducibility: easy to reproduce results, no special new `sm` requirements
27
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ### How to use
30
 
@@ -117,4 +127,4 @@ peak_mem_mb = torch.cuda.max_memory_allocated() / (1024 * 1024)
117
  print(f"Output: sum={output.sum().item():.1f}, min={output.min().item():.1f}, max={output.max().item():.1f}")
118
  print(f"First 3: {output.view(-1)[:3].tolist()}")
119
  print(f"Time: {elapsed_ms:.1f}ms, Memory: {peak_mem_mb:.0f}MB")
120
- ```
 
25
  - low memory usage: optimized to handle large batch sizes
26
  - reproducibility: easy to reproduce results, no special new `sm` requirements
27
 
28
+ ### Performance
29
+
30
+ `yamoe` scales well as batch sizes increase in comparision to the naive method of repeating the data and computation for every item in the batch as shown in the reference in [torch-ext/yamoe/reference.py](torch-ext/yamoe/reference.py). This bench can be reproduced by running `uv run perf_plot.py` or a smaller bench and correctness comparision can be run with `uv run compare_example.py`
31
+
32
+
33
+ TLDR: smaller is better on the first two rows of charts
34
+
35
+ <img width="3583" height="2358" alt="moe_performance_comparison" src="https://github.com/user-attachments/assets/72938f64-ec05-4eaa-82c4-507a43891543" />
36
+
37
+
38
 
39
  ### How to use
40
 
 
127
  print(f"Output: sum={output.sum().item():.1f}, min={output.min().item():.1f}, max={output.max().item():.1f}")
128
  print(f"First 3: {output.view(-1)[:3].tolist()}")
129
  print(f"Time: {elapsed_ms:.1f}ms, Memory: {peak_mem_mb:.0f}MB")
130
+ ```