Transformers are Multi-State RNNs

Transformers can be conceptualized as infinite-state RNNs and efficiently compressed using TOVA, a policy that retains tokens based on attention scores.

This paper conceptually bridges the gap between Transformers and Recurrent Neural Networks (RNNs), demonstrating that decoder-only Transformers function as “unbounded multi-state RNNs.” By limiting the size of the hidden state (the Key-Value cache), the authors introduce a training-free compression policy called TOVA (Token Omission Via Attention). Instead of relying on fixed context windows, TOVA dynamically retains only the tokens with the highest attention scores. The results show that this method can reduce cache size to 1/8th of the original and increase throughput by 4.8x without significant performance loss, proving that massive context windows can be managed efficiently by discarding tokens that the model itself deems unimportant.

Link