Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers

A Microsoft research team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layer...

By · · 1 min read

Source: syncedreview.com

A Microsoft research team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layers) than previous deep transformers.