Good idea. Generate 8 tokens with a small model, then give them to a large model as one batch (much faster) and if tokens 1..3 are in agreement but 4..8 nut, take only 1..4 and start again. Most tokens are easy to guess so you'll get x3.5 gains.
I do have a feeling of dejavú, like I've seen this before on hn.
They mention previous work on speculative decoding using similar techniques, but "ANPD dynamically generates draft outputs
via an adaptive N-gram module using real-time
statistics, after which the drafts are verified by the
LLM. This characteristic is exactly the difference
between ANPD and the previous speculative decoding methods."
GPUs are great at doing the same math on every item of a large multidimensional array.
Therefore, unsurprisingly, the cost per item of inference on a batch of items is significantly lower when the batch is e.g. 8 than 1 (in the case of Transformers there are further gains to be made because roughly half of the attention calculations in token k+1 are identical to the calculations of token k and can be easily reused by writing the formulas a certain way, the keyword to look for is causal attention mask).
In any reasonable GPU inference setup the weights would be preloaded.
Indeed GPUs are great at doing the same calculation in parallel. But if it was just that there should be enough opportunity to parallelise even without doing the exact same calculation multiple times.
The main reason I can come up with why doing the same calculation 8 times in parallel instead of 8 times sequentially is that you benefit from better locality of reference.
As I said, the attention step is O(n^2) per token sequentially and O(n^2) for the entire sequence when calculating the entire sequence in parallel, where n is the length of the sequence.
They use a separate ngram model to generate the proposed sequence instead of extra heads on top of the main model. The process of verifying the proposed sequence appears to be the same.
The speedup here would be very dependent on the context -- the kind of texts that the models are working with, as it proposes a rather naive n-gram generator (maybe I should say it does not provide any details on this critical component, instead simply refers to Jurafsky textbook). It might not be robust. Instead Apple's work on using the same model to produce n-gram lookahead is robust -- the n-gram generator works as well as the model itself: https://arxiv.org/abs/2402.11131
The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:
> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.
ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.
I might be naive comes to classic ML / NLP: how do you keep the prefix table for N = 5? Naively, that look-up table would be 100k^5 (assuming 100k vocabulary size). Is it very sparse? How large that usually is?