Dynamic Tensor Rematerialization(DTR)

paper slides video
Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock

DTR:

Adoption:

MegEngine

Technical Dive

  1. Gradient Checkpointing is a technique to save memory for Deep Neural Network training.
  2. Or more generally, for reverse-mode automatic differentiation.
  3. However, memory planning is np-complete.
  4. Checkpointing also has to deal with program with arbitary control flow.
  5. To combat this, previous work made different restriction which sacrifice performance or usability.
  6. Some works models the program as a stack machine with no heap...
  7. And suffers performance degradtion when the assumption is broken!
  8. (For example, NN with highway connection/branching).
  9. Other works use an ILP solver, which consume lots of time to find the optimal memory planning.
  10. And can only be used for program/framework without control flow, posing problem for real world adoption.
  11. Additionally, gradient checkpointing couple derivative calculation, with memory saving by recomputing.
  12. This add complexity and limit applications range.
  13. DTR tackles the problems above by planning the memory greedily at runtime, instead of as a compiler pass.
  14. This solves the control flow and stack machine issue, as we do not model the program in anyway!
  15. However, with a novel cache eviction policy, we are still able to achieve great performance.

Visitor count: web counter