The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
Oct 11, 2022
- Lottery Ticket Hypothesis
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective
⇒ winning ticket (good initialization) → CAN WIN LOTTERY! (learn faster, reach higher test accuracy)
Introduction
Different pruning methods
- What connectivity structures to prune?
- How to rank weights to prune?
- How often to prune?
- When to perform the pruning step?
Proposed Method
Previous iterative pruning methods
- one-shot ( train → prune p% of weights),
- iterative (each round prunes % of the parameters, trains and re-prunes, repeating this n times)
Novelty
- iterative but reinitialize remaining parameters to “initial parameters” and retrain and re-prune
- idea: all the responsibilities of pruned weights carried before are now transferred on to remaining parameters, so we need to recalibrate remaining weights ( imagine a situation where we cut network connections if we prune the next smallest weight, it doesn’t makes sense to cut it )
- pruning find winning tickets that learn faster than the original network while reaching higher test accuracy and generalizing better
This paper’s pruning method
- fully-connected architecture for MNIST
- convolutional architectures for CIFAR10
- unstructured, magnitude, iterative, prune → train (when to prune)
- structured vs unstructured pruning ( e.g. “prune from axis 0 in conv layer” vs “prune from conv layer”)
- local vs global pruning
- prune 20 % locally
- 1000 param in conv layer, 2000 param in fc layer →800 params in conv layer, 1600 params in fc layer)
- prune 20 % globally
- 1000, 2000 → prune 600 params from whole model (good to use when we have parameter bottleneck like conv layer)
Identify Winning tickets ( two ways : one shot & iterative pruning )
1. One shot Pruning
- Randomly initialize a neural network f(x; ) (where ∼ ).
- Train the network for j iterations, arriving at parameters .
- Prune p% of the parameters in , creating a mask .
- Reset the remaining parameters to their values in , creating the winning ticket f(x; , ).
2. Iterative Pruning Strategies (to find winning ticket!)
- Strategy 1: Iterative pruning with resetting (this paper’s novelty)
- initialize randomly, , (mask)
- train network for j iteration → (parameters)
- prune s% of the parameters → update mask → `
- reset the weights of the remaining portion of the network to
- = ` repeat 2 ~ 4
- Strategy 2: Iterative pruning with continued training
- initialize randomly, , (mask)
- train network for j iteration
- prune s % of the parameters, update mask → `
- repeat 2 ~ 3
Experiment
Winning Tickets in Fully-connected Networks (MNIST)
- Figure 4.a : unlike winning tickets, the reinitialized networks learn slower than original network and lose test accuracy after little pruning
- Figure 3 : in FC networks, pruning helps generalize better
Winning Tickets in Convolutional Networks (CIFAR10)
- early-stop iteration : the lower the better → all resettings are lower than their counterparts
- accuracy at Early-stop : the higher the better → all resettings are higher than their counterparts
Importance of initialization ( important to initialize network to winning ticket’s initialization)
(NOTE! important to initialize network to winning ticket’s initialization, not winning ticket’s final value)
- figure 4.a : compare Random Reinit(iterative) vs Winning Ticket (iterative)
Conclusions
- winning tickets learn faster than original network
- winning tickets learn better than original network (even with 20% of weights)
Limitations
- cannot train these winning ticket networks from scratch
Pruning demo in pytorch
References
- What is the Lottery Ticket Hypothesis, and why is it important? https://www.youtube.com/watch?v=jeFMWtddkTs
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks https://www.youtube.com/watch?v=ZVVnvZdUMUk
Share article