Each year, larger and larger models are able to find methods for extracting signal from the noise in machine learning. In particular, language models get larger every day. These models are computationally expensive (in both runtime and memory), which can be both costly when served out to customers or too slow or large to function in edge environments like a phone.
Researchers and practitioners have come up with many methods for optimizing neural networks to run faster or with less memory usage. In this post I’m going to cover some of the state-of-the-art methods. If you know of another method you think should be included, I’m happy to add it. This has a slight PyTorch bias (haha) because I’m most familiar with it.
Source: Deep learning model compression, an article by Rachit Singh.