The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
Inc Lomin's avatar
Oct 11, 2022
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  • Lottery Ticket Hypothesis
    • A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective
      ⇒ winning ticket (good initialization) → CAN WIN LOTTERY! (learn faster, reach higher test accuracy)
 

Introduction

Different pruning methods

  • What connectivity structures to prune?
  • How to rank weights to prune?
  • How often to prune?
  • When to perform the pruning step?
notion image
 

Proposed Method

Previous iterative pruning methods

  • one-shot ( train → prune p% of weights),
  • iterative (each round prunes % of the parameters, trains and re-prunes, repeating this n times)

Novelty

  • iterative but reinitialize remaining parameters to “initial parameters” and retrain and re-prune
    • idea: all the responsibilities of pruned weights carried before are now transferred on to remaining parameters, so we need to recalibrate remaining weights ( imagine a situation where we cut network connections if we prune the next smallest weight, it doesn’t makes sense to cut it )
  • pruning find winning tickets that learn faster than the original network while reaching higher test accuracy and generalizing better

This paper’s pruning method

notion image
  • fully-connected architecture for MNIST
  • convolutional architectures for CIFAR10
  • unstructured, magnitude, iterative, prune → train (when to prune)
    • structured vs unstructured pruning ( e.g. “prune from axis 0 in conv layer” vs “prune from conv layer”)
    • local vs global pruning
      • prune 20 % locally
        • 1000 param in conv layer, 2000 param in fc layer →800 params in conv layer, 1600 params in fc layer)
      • prune 20 % globally
        • 1000, 2000 → prune 600 params from whole model (good to use when we have parameter bottleneck like conv layer)
 

Identify Winning tickets ( two ways : one shot & iterative pruning )

1. One shot Pruning

  1. Randomly initialize a neural network f(x; ) (where ).
  1. Train the network for j iterations, arriving at parameters .
  1. Prune p% of the parameters in , creating a mask .
  1. Reset the remaining parameters to their values in , creating the winning ticket f(x; , ).

2. Iterative Pruning Strategies (to find winning ticket!)

  • Strategy 1: Iterative pruning with resetting (this paper’s novelty)
      1. initialize randomly, , (mask)
      1. train network for j iteration → (parameters)
      1. prune s% of the parameters → update mask `
      1. reset the weights of the remaining portion of the network to
      1. = ` repeat 2 ~ 4
notion image
  • Strategy 2: Iterative pruning with continued training
      1. initialize randomly, , (mask)
      1. train network for j iteration
      1. prune s % of the parameters, update mask `
      1. repeat 2 ~ 3
 

Experiment

 

Winning Tickets in Fully-connected Networks (MNIST)

notion image
notion image
  • Figure 4.a : unlike winning tickets, the reinitialized networks learn slower than original network and lose test accuracy after little pruning
  • Figure 3 : in FC networks, pruning helps generalize better

Winning Tickets in Convolutional Networks (CIFAR10)

notion image
  • early-stop iteration : the lower the better → all resettings are lower than their counterparts
  • accuracy at Early-stop : the higher the better → all resettings are higher than their counterparts

Importance of initialization ( important to initialize network to winning ticket’s initialization)

(NOTE! important to initialize network to winning ticket’s initialization, not winning ticket’s final value)
notion image
  • figure 4.a : compare Random Reinit(iterative) vs Winning Ticket (iterative)

Conclusions

  • winning tickets learn faster than original network
  • winning tickets learn better than original network (even with 20% of weights)

Limitations

  • cannot train these winning ticket networks from scratch

Pruning demo in pytorch

References

Share article