Most data scientists soon realize that deep learning models can be unwieldy and often impractical to run on smaller devices without major modification. Our friends over at deeplearning.ai recently communicated about a group of researchers at Facebook AI Research determined a new technique to compress neural networks with minimal loss in accuracy.
Building on earlier work, the team coaxed networks to learn smaller layer representations. Rather than storing weights directly, the method uses approximate values that can stand in for groups of weights. Essentially, the researchers modified an existing data-compression method, product quantization, to learn viable weight approximations. The research results are detailed in a paper “And the Bit Goes Down: Revisiting the Quantization of Neural Networks,” by P. Stock et al.
The method works by representing groups of similar weights with a single value, the network can store only that value and pointers to it. This reduces the amount of storage needed for weights in a given layer. The network learns an optimal set of values for groups of weights, or subvectors, in a layer by minimizing the difference between layer outputs of the original and compressed networks.
fully connected layers, the authors group the weights into subvectors.
(They propose a similar but more involved process for convolutional
pick a random subset of subvectors as starting values, then iteratively
improve the values, layer by layer, to minimize the difference between
the compressed and original neural network.
they optimize the compressed network representation against multiple
layers at a time, starting with the first two and ultimately
encompassing the entire network.
The team achieves best top-1 accuracy on ImageNet for model sizes of 5MB and 10MB, and competitive accuracy for 1MB models. They also show that their quantization method is superior to previous methods for ResNet-18.
Typically, deep learning researchers establish the best model for a given task, and follow-up studies find new architectures that deliver similar performance using less memory. This work offers a way to compress an existing architecture, potentially taking any model from groundbreaking results in the lab to widespread distribution in the field with minimal degradation in performance.
The authors demonstrate their method on architectures with fully connected layers and CNNs only. Further research will be required to find its limits, and also to optimize the compressed results for compute speed.
The ability to compress top-performing models could put state-of-the-art AI in the palm of your hand and eventually in your pacemaker.
Sign up for the free insideBIGDATA newsletter.