xjdr on X: \"# Why Training MoEs is So Hard recently, i have found myself wanting a small, research focused training r...
xjdr on why training MoEs under 20B params is hard: flop efficiency, load-balancing/router stability, and data quality/quantity.
This is a SimPPL canonical link to a reading shared in our newsletter. Browse the rest at simppl.org/library.
