Lossless Acceleration of LLM via Adaptive N-Gram Parallel Decoding

kristjansson · on April 21, 2024

This is speculative decoding with a n-gram Markov chain instead of a weaker transformer model in the “speculating” position.

huevosabio · on April 21, 2024

Thanks, this is a great, succinct way of summarizing the paper.

SonOfLilit · on April 21, 2024

Good idea. Generate 8 tokens with a small model, then give them to a large model as one batch (much faster) and if tokens 1..3 are in agreement but 4..8 nut, take only 1..4 and start again. Most tokens are easy to guess so you'll get x3.5 gains.

I do have a feeling of dejavú, like I've seen this before on hn.

milkey_mouse · on April 21, 2024

> I do have a feeling of dejavú, like I've seen this before on hn.

You're either thinking of speculative encoding more generally, or Medusa: https://arxiv.org/abs/2401.10774

johntb86 · on April 21, 2024

They mention previous work on speculative decoding using similar techniques, but "ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods."

contravariant · on April 21, 2024

Just so I understand, this is essentially faster because reading all the model weights takes more time than calculating the model output?

SonOfLilit · on April 22, 2024

GPUs are great at doing the same math on every item of a large multidimensional array.

Therefore, unsurprisingly, the cost per item of inference on a batch of items is significantly lower when the batch is e.g. 8 than 1 (in the case of Transformers there are further gains to be made because roughly half of the attention calculations in token k+1 are identical to the calculations of token k and can be easily reused by writing the formulas a certain way, the keyword to look for is causal attention mask).

In any reasonable GPU inference setup the weights would be preloaded.

contravariant · on April 22, 2024

Indeed GPUs are great at doing the same calculation in parallel. But if it was just that there should be enough opportunity to parallelise even without doing the exact same calculation multiple times.

The main reason I can come up with why doing the same calculation 8 times in parallel instead of 8 times sequentially is that you benefit from better locality of reference.

SonOfLilit · on April 22, 2024

As I said, the attention step is O(n^2) per token sequentially and O(n^2) for the entire sequence when calculating the entire sequence in parallel, where n is the length of the sequence.

outofpaper · on April 21, 2024

I think a number of us are using the method and these folks decided to do a paper and release their findings.

jimmySixDOF · on April 22, 2024

Lots of speedups already on Llama3 like Groq (hardware arch) and Modal [1] getting 200 t/s

From the below Modal link they use: >continuous batching, so multiple generations can take place at the same time on a single container

>PagedAttention, which applies memory paging to the attention mechanism’s key-value cache, increasing throughput

[1]https://modal.com/docs/examples/text_generation_inference

sdenton4 · on April 21, 2024

Very reminiscent of parallel wavenet, which speed things up by generating multiple audio samples at a time.

nsagent · on April 21, 2024

How does this differ from the 2018 NeurIPS paper, Blockwise Parallel Decoding for Deep Autoregressive Models?

https://arxiv.org/abs/1811.03115

tripplyons · on April 22, 2024

They use a separate ngram model to generate the proposed sequence instead of extra heads on top of the main model. The process of verifying the proposed sequence appears to be the same.

fspeech · on April 21, 2024

The speedup here would be very dependent on the context -- the kind of texts that the models are working with, as it proposes a rather naive n-gram generator (maybe I should say it does not provide any details on this critical component, instead simply refers to Jurafsky textbook). It might not be robust. Instead Apple's work on using the same model to produce n-gram lookahead is robust -- the n-gram generator works as well as the model itself: https://arxiv.org/abs/2402.11131

MyFirstSass · on April 21, 2024

If this is "plug and play" can it be added to say llama.cpp and give ~3.67 speedup to existing models or is there some complication?

jncraton · on April 21, 2024

The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926

MacsHeadroom · on April 22, 2024

Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.

eshoyuan · on April 22, 2024

https://github.com/ggerganov/llama.cpp/tree/master/examples/...

tripplyons · on April 22, 2024

The HuggingFace transformers library already has support for a similar method called prompt lookup decoding that uses the existing context to generate an ngram model: https://github.com/huggingface/transformers/issues/27722

I don't think it would be that hard to switch it out for a pretrained ngram model.

mirekrusin · on April 22, 2024

They should support another layer of pre-generation that is controlled on client side/through grammar instead of n-gram markov chain / small network.

Especially when it comes to programming languages or data formats like json - a lot of compute is spent on ie. silly whitespaces.

liuliu · on April 22, 2024

I might be naive comes to classic ML / NLP: how do you keep the prefix table for N = 5? Naively, that look-up table would be 100k^5 (assuming 100k vocabulary size). Is it very sparse? How large that usually is?

eshoyuan · on April 22, 2024

I don't think this will be better than weaker transformer.