The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.

Inc Lomin

Oct 11, 2022

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Contents

Introduction Proposed Method Experiment Conclusions

Lottery Ticket Hypothesis

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective

⇒ winning ticket (good initialization) → CAN WIN LOTTERY! (learn faster, reach higher test accuracy)

Introduction

Different pruning methods

What connectivity structures to prune?

How to rank weights to prune?

How often to prune?

When to perform the pruning step?

Proposed Method

Previous iterative pruning methods

one-shot ( train → prune p% of weights),

iterative (each round prunes % of the parameters, trains and re-prunes, repeating this n times)

Novelty

iterative but reinitialize remaining parameters to “initial parameters” and retrain and re-prune

idea: all the responsibilities of pruned weights carried before are now transferred on to remaining parameters, so we need to recalibrate remaining weights ( imagine a situation where we cut network connections if we prune the next smallest weight, it doesn’t makes sense to cut it )

pruning find winning tickets that learn faster than the original network while reaching higher test accuracy and generalizing better

This paper’s pruning method

fully-connected architecture for MNIST

convolutional architectures for CIFAR10

unstructured, magnitude, iterative, prune → train (when to prune)

structured vs unstructured pruning ( e.g. “prune from axis 0 in conv layer” vs “prune from conv layer”)
local vs global pruning

prune 20 % locally

1000 param in conv layer, 2000 param in fc layer →800 params in conv layer, 1600 params in fc layer)

prune 20 % globally

1000, 2000 → prune 600 params from whole model (good to use when we have parameter bottleneck like conv layer)

Identify Winning tickets ( two ways : one shot & iterative pruning )

1. One shot Pruning

Randomly initialize a neural network f(x; ) (where ∼ ).

Train the network for j iterations, arriving at parameters .

Prune p% of the parameters in , creating a mask .

Reset the remaining parameters to their values in , creating the winning ticket f(x; , ).

2. Iterative Pruning Strategies (to find winning ticket!)

Strategy 1: Iterative pruning with resetting (this paper’s novelty)

initialize randomly, , (mask)

train network for j iteration → (parameters)

prune s% of the parameters → update mask → `

reset the weights of the remaining portion of the network to

= ` repeat 2 ~ 4

Strategy 2: Iterative pruning with continued training

initialize randomly, , (mask)

train network for j iteration

prune s % of the parameters, update mask → `

repeat 2 ~ 3

Experiment

Winning Tickets in Fully-connected Networks (MNIST)

Figure 4.a : unlike winning tickets, the reinitialized networks learn slower than original network and lose test accuracy after little pruning

Figure 3 : in FC networks, pruning helps generalize better

Winning Tickets in Convolutional Networks (CIFAR10)

early-stop iteration : the lower the better → all resettings are lower than their counterparts

accuracy at Early-stop : the higher the better → all resettings are higher than their counterparts

Importance of initialization ( important to initialize network to winning ticket’s initialization)

(NOTE! important to initialize network to winning ticket’s initialization, not winning ticket’s final value)

figure 4.a : compare Random Reinit(iterative) vs Winning Ticket (iterative)

Conclusions

winning tickets learn faster than original network

winning tickets learn better than original network (even with 20% of weights)

Limitations

cannot train these winning ticket networks from scratch

Pruning demo in pytorch

https://github.com/jinhopark8345/torch_pruning_demo/blob/main/main.py

References

What is the Lottery Ticket Hypothesis, and why is it important? https://www.youtube.com/watch?v=jeFMWtddkTs

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks https://www.youtube.com/watch?v=ZVVnvZdUMUk

https://velog.io/@bismute/The-Lottery-Ticket-Hypothesis와-그-후속-연구들-리뷰

https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/

See more posts