What's the most cost-effective alternative to Paperspace? I had a nightmarish experience with them last week after my account got locked up twice when I was training a model with a 1.5 GB dataset that somewhere contained the string "Minecraft Server".
When I experimented with Stable Diffusion and ROCM (amd card), i had to do similar but with pythorch-rocm. and when I was doing a CPU only, did `pytorch-cpu`. So maybe your attempt didn't use the GPUs at all, because 12 mins is about what I had on a CPU for inference on other models of similar size.
The error message implies that the compiled default libraries on the M1 don't support the model format, even though it works fine in Paperspace.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in <module>
instruct_pipeline = pipeline(
^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 776, in pipeline
framework, model = infer_framework_load_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).
No worries, it happens. I will admit the way I answered wasn't clear that I was referring to the linked page and not the question in the post. All good.
I attempted using the Transformers library but failed. Not sure, might be a VRAM issue; I'm going to try on my far beefier personal MacBook Pro later tonight.
How much ram is likely needed on an apple arm for models like this? And for general use, 64, 96, 128? Trying to decide how large I should go for a new laptop.
I very recently purchased a MacBook Pro (M1 Max) with 64GB of ram. I haven't experimented that much, but I was able to run inference using the 65B parameter Llama model with quantized weights at a speed that was reasonably usable (maybe a touch slower than ChatGPT with GPT-4).
I haven't attempted to use the 65B model with non-quantized weights, but the smaller models work that way, if slowly. With 96GB of ram -- the upper limit of a MacBook Pro -- you might be able to use even larger models, but I think you'd hit the limits of useful performance before that point.
I should note that it can be a bit tricky getting things to work using the Mac's GPU. I couldn't get Dolly 6B to run on my work MBP, which theoretically should have enough ram, though I still want to try it on my personal laptop.
I see refurbished m1 2tb/128gb for $4700, looks like similar price for an m2 with same storage/ram with my corp discount (20cpu/48gpu). This is a tough decision.
AFAIK current models can run even with 64GB, but I would assume that we will very likely have bigger models very soon so I guess the answer is as much as you can afford
The next question is m1 or m2, and the impact of the various number of gpu units between pro, max, ultra skews. I'm really tempted to buy a "refurbished m1 studio" with 128gb because I think the ram is the key. Have not seen any benchmarks with diff # of gpus/aka diff skews.
it runs slightly slower on the GPU than under llama.cpp but uses much less power doing so
I would guess the slowness is due to immaturity of the PyTorch MPS backend, the asitop graphs show it doing a bunch of cpu along with the gpu, so it might be inefficiently falling back to cpu for some ops and swapping layers back and forth (I have no idea, just guessing)
Anyone managed to run it on an M1/M2 Mac yet?