1. We just released a couple models that are much smaller (https://huggingface.co/databricks/dolly-v2-6-9b), and these should be much easier to run on commodity hardware in a reasonable amount of time.
2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly?
3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup
I've also been struggling to run anything but the smallest model you have shared on paper space:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-6-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device=0)
generate_text("Explain to me the difference between nuclear fission and fusion.")
Causes the kernel to crash, GPU should be plenty
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:00:05.0 Off | Off |
| 26% 45C P8 10W / 250W | 6589MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I'm extremely excited to try these models but they are by far the most difficult experience I've ever had trying to do basic inference.
I’ve never used Paperspace, so I’ll try to give it a try this weekend. How much RAM do you have attached to the compute. We don’t think it should be any harder to run this via HF pipelines than other similarly sized models, but I’ll look into it.
1. We just released a couple models that are much smaller (https://huggingface.co/databricks/dolly-v2-6-9b), and these should be much easier to run on commodity hardware in a reasonable amount of time.
2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly?
3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup