Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lossless Acceleration of LLM via Adaptive N-Gram Parallel Decoding (arxiv.org)
136 points by PaulHoule on April 21, 2024 | hide | past | favorite | 23 comments


This is speculative decoding with a n-gram Markov chain instead of a weaker transformer model in the “speculating” position.


Thanks, this is a great, succinct way of summarizing the paper.


Good idea. Generate 8 tokens with a small model, then give them to a large model as one batch (much faster) and if tokens 1..3 are in agreement but 4..8 nut, take only 1..4 and start again. Most tokens are easy to guess so you'll get x3.5 gains.

I do have a feeling of dejavú, like I've seen this before on hn.


> I do have a feeling of dejavú, like I've seen this before on hn.

You're either thinking of speculative encoding more generally, or Medusa: https://arxiv.org/abs/2401.10774


They mention previous work on speculative decoding using similar techniques, but "ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods."


Just so I understand, this is essentially faster because reading all the model weights takes more time than calculating the model output?


GPUs are great at doing the same math on every item of a large multidimensional array.

Therefore, unsurprisingly, the cost per item of inference on a batch of items is significantly lower when the batch is e.g. 8 than 1 (in the case of Transformers there are further gains to be made because roughly half of the attention calculations in token k+1 are identical to the calculations of token k and can be easily reused by writing the formulas a certain way, the keyword to look for is causal attention mask).

In any reasonable GPU inference setup the weights would be preloaded.


Indeed GPUs are great at doing the same calculation in parallel. But if it was just that there should be enough opportunity to parallelise even without doing the exact same calculation multiple times.

The main reason I can come up with why doing the same calculation 8 times in parallel instead of 8 times sequentially is that you benefit from better locality of reference.


As I said, the attention step is O(n^2) per token sequentially and O(n^2) for the entire sequence when calculating the entire sequence in parallel, where n is the length of the sequence.


I think a number of us are using the method and these folks decided to do a paper and release their findings.


Lots of speedups already on Llama3 like Groq (hardware arch) and Modal [1] getting 200 t/s

From the below Modal link they use: >continuous batching, so multiple generations can take place at the same time on a single container

>PagedAttention, which applies memory paging to the attention mechanism’s key-value cache, increasing throughput

[1]https://modal.com/docs/examples/text_generation_inference


Very reminiscent of parallel wavenet, which speed things up by generating multiple audio samples at a time.


How does this differ from the 2018 NeurIPS paper, Blockwise Parallel Decoding for Deep Autoregressive Models?

https://arxiv.org/abs/1811.03115


They use a separate ngram model to generate the proposed sequence instead of extra heads on top of the main model. The process of verifying the proposed sequence appears to be the same.


The speedup here would be very dependent on the context -- the kind of texts that the models are working with, as it proposes a rather naive n-gram generator (maybe I should say it does not provide any details on this critical component, instead simply refers to Jurafsky textbook). It might not be robust. Instead Apple's work on using the same model to produce n-gram lookahead is robust -- the n-gram generator works as well as the model itself: https://arxiv.org/abs/2402.11131


If this is "plug and play" can it be added to say llama.cpp and give ~3.67 speedup to existing models or is there some complication?


The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926


Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.



The HuggingFace transformers library already has support for a similar method called prompt lookup decoding that uses the existing context to generate an ngram model: https://github.com/huggingface/transformers/issues/27722

I don't think it would be that hard to switch it out for a pretrained ngram model.


They should support another layer of pre-generation that is controlled on client side/through grammar instead of n-gram markov chain / small network.

Especially when it comes to programming languages or data formats like json - a lot of compute is spent on ie. silly whitespaces.


I might be naive comes to classic ML / NLP: how do you keep the prefix table for N = 5? Naively, that look-up table would be 100k^5 (assuming 100k vocabulary size). Is it very sparse? How large that usually is?


I don't think this will be better than weaker transformer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: