Researchers at Google have proposed a new gated linear loop layer called the RG-LRU layer and designed a new loop block around it as an alternative to multi-query attention (MQA). They used this recurrent block to construct two new models: a model Hawk that combines MLP and the recurrent block, and a model Griffin that combines MLP, the recurrent block, and localized attention.By over-training Hawk and Griffin on 300B tokens for a range of different model sizes, it was found that Hawk-3B outperformed MQA on the downstream task outperforms Mamba-3B, but with half the number of tokens trained. In addition, Griffin-7B and Griffin-14B have comparable performance to Llama-2 but train 1/7th the number of tokens, respectively.
Link to paper:
https://arxiv.org/pdf/2402.19427.pdf