Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Databricks Releases 15K Record Training Corpus for Instruction Tuning LLMs (github.com/databrickslabs)
347 points by xatalytic on April 12, 2023 | hide | past | favorite | 89 comments


15,000 instruction tuning records generated by Databricks employees in seven of the behavior categories outlined in the InstructGPT paper (predecessor to ChatGPT). Coincides with the release of Dolly 2.0, which is trained exclusively on this dataset and demonstrates high quality (but not state-of-the-art) instruction-following behavior.

The data and models are licensed for commercial use, setting them apart from recent releases trained on data from OpenAI.


>Coincides with the release of Dolly 2.0, which is trained exclusively on this dataset and demonstrates high quality (but not state-of-the-art) instruction-following behavior.

This is not correct. It was fine-tuned with this data set, but the model itself is the 12B Eleuther AI pythia model.


There are two, a 6B parameter model fine-tuned on GPT-J and a 12B parameter model fine-tuned on Pythia.


the GPT-J-6B one is Dolly 1.0, previously released

Dolly 2.0 is Pythia-12B fine-tuned on this new dataset

on their hugging face page [1] they admit the performance may not be much or any better than the original model (I am guessing this may be a weakness of Pythia-12B, which was intended for model-training research rather than best results)

the main point of Dolly 2.0 is the new dataset is unencumbered legally [2] whereas Alpaca et al were trained on ChatGPT transcripts, so commercialising those models would contradict OpenAI licensing terms

[1] https://huggingface.co/databricks/dolly-v2-12b

[2] https://www.databricks.com/blog/2023/04/12/dolly-first-open-...


I think there's probably nothing wrong with training on others' ChatGPT transcripts posted on the open web. OpenAI trains on source-available projects with non-commercial terms, so their lawyers have already been over a similar case and decided it should be fine.


Not just that: Imagine OpenAI going to court and establishing the legal precedent that makes their own product illegal.

So OpenAI can claim whatever they like, there is no way they will ever pursue legal actions, unless their intent is to (intentionally) lose the court case to establish the precedent that it is okay to train on random data you scraped from the internet.

We would also get into a weird situation anyhow where it is hard/impossible to prove whether all/some/none of the information in a dataset is curated by humans. So in the worst case, we will have companies work with human curators (but secretly supplement with gray sourced materials) during their training. Just like how its hard to get 100% slave free coffee beans or cacao.


I don't think it's about things being illegal per se

But that they can sue you because, by making a competing product with data obtained by using their product, you contravened their terms & conditions for using their product


But so did they when they scraped the web for content.

That's not within anyone's terms and conditions except Wikipedia.

That's what I mean with precedent. If OpenAI would win that they would be sued in term by Bloomberg for example.


Here's a link to open up and explore that training data in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrick...


I'm going through the dataset with your datasette tool and it looks like it might be a good idea to clean things up a bit. There are many duplicates[1], creepypastas[2] and other strange things in there.

[1] https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser...

[2] https://lite.datasette.io/?json=https://github.com/databrick...

EDIT: Maybe I'm passing link wrong, the query I'm using is

select count(instruction), instruction, group_concat(context, ' ============= ') as c, group_concat(response, ' ============= ') as r, group_concat(category, ' ============= ') as cat from [databricks-dolly-15k] group by instruction having count(instruction)>1 order by count(instruction)desc limit 100

[databricks-dolly-15k] should be the name of dataset, first column is the number of instruction duplicates

Creepypastas are responses to instruction:

Imagine you are the last person on Earth. Write a diary entry describing your thoughts and feelings.


Typo on row 7!


row 7 is the name of the dataset, you might need to load it yourself


Can someone help me to understand why categories for these two differ?

row #51 "Think of some family rules to promote a healthy family relationship" - brainstorsming [1]

row #68 "What is the future for human?" - general_qa [2]

In nature they both are brainstorming to me - does the question mark is what assigned the #68 as _qa?

[1] https://lite.datasette.io/?json=https://github.com/databrick...

[2] https://lite.datasette.io/?json=https://github.com/databrick...


The labelling doesn't seem to be entirely consistent to me, but I think the idea is that 51 is inviting you to brainstorm, while 68 is asking a question that just happens to be open ended.


Hey! Worked on this here at Databricks: the blog post goes into the dataset collection design a bit (https://www.databricks.com/blog/2023/04/12/dolly-first-open-...). In summary, you're right - brainstorming and GeneralQA will have overlap because the taxonomy naturally has some overlap


This is the blog post with more details and background: https://www.databricks.com/blog/2023/04/12/dolly-first-open-...

Disclosure: I work at Databricks.


We also open sourced the Dolly model itself with a license that allows commercial use.



How hard would it be to get dolly running on llama.cpp?


Hey there! I worked on Dolly, and I work on Model Serving at Databricks. DollyV1 is GPT-J-based, so it'll run easily on llama.cpp. DollyV2 is Pythia-based, which is built with the GPT-NeoX library

GPT-NeoX is not that different than GPT-J (it also has the rotary embeddings, which llama.cpp supports for GPT-J). I would imagine it's not too heavy of a lift to add NeoX architecture support


Because the firehost of AI/GPT is a lot to try to take in, please ELI5 unpack and provide more definitions for this comment.

-

Thank you.

Just so I am clear, "parameters" refers to the number of total node-relation-connections btwn a single node and its neighbors for that Prompt/Label? Or how would you explain this ELI5 style?


Sure! I'll try to briefly summarize though almost certainly will oversimplify. There are a couple of open source language models trained by Eleuther AI - the first one was called GPT-J, and it used some newer model architecture concepts. Subsequently, they released a model architected in the likeness of GPT-3, called GPT-NeoX-20B. Functionally, it was quite similar architecturally to GPT-J, but just with more parameters. Pythia is a model with the same architecture and the same dataset but with different parameter sizes to test scaling laws.

DollyV2 is a Pythia model fine tuned on the Databricks 15K dataset


Augmenting the answer to address your followup: parameters are any trainable variable in a model's definition. Model training is a process where you basically tweak the parameters in your model and then re-evaluate the model on a metric judging its quality. A lot of models consist of matrix multiplication, so if you are multiplying matrix A of size 2x2 with matrix B of size 2x2 and both matrices can we tweaked, then you've got 8 parameters, since you've got 8 numbers that can be tweaked


it's probably simple for Dolly v1 (?) since it was a fine-tuned version of GPT-J

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j

AFAIK there is no .cpp version of Pythia-12B yet


Would you consider adding Pythia12B, LLaMa and Alpaca since that's what you're directly compared against/based on?

GPT3.5/GPT4 is what everyone would also love to see but I understand you're performance is inline with GPT-neoX.

Vicuna/GPT4all would be intersting but IMO are less important.

RWKV would be interesting because it's a completely different model from the transformers.

EDIT: Also thanks for the opensource contributions! Highly appreciated!


Thank you and congrats to you and the team. This is fantastic


Thank you, thank you, thank you!

If possible, could you share how Dolly v2 compares to RWKV-4 14B ctx 8019?


I got this model working on a GPU instance, notes here: https://til.simonwillison.net/llms/dolly-2

Anyone managed to run it on an M1/M2 Mac yet?


What's the most cost-effective alternative to Paperspace? I had a nightmarish experience with them last week after my account got locked up twice when I was training a model with a 1.5 GB dataset that somewhere contained the string "Minecraft Server".


I picked them almost at random from the list suggested by this Fast.AI course: https://course.fast.ai/Lessons/lesson9.html#links-from-the-l...


Im not an expert, and I don't have nvidia, but I assume you need to setup CUDA and install the CUDA pytorch stuff?

Most docs Ive read on setting up finetuners and inference require some extra stuff. Taking some LORA fine tuners, they include instructions like this:

  conda create -n llm-finetuner python=3.10
  conda activate llm-finetuner
  conda install -y cuda -c nvidia/label/cuda-11.7.0
  conda install -y pytorch=2 pytorch-cuda=11.7 -c pytorch
When I experimented with Stable Diffusion and ROCM (amd card), i had to do similar but with pythorch-rocm. and when I was doing a CPU only, did `pytorch-cpu`. So maybe your attempt didn't use the GPUs at all, because 12 mins is about what I had on a CPU for inference on other models of similar size.


The error message implies that the compiled default libraries on the M1 don't support the model format, even though it works fine in Paperspace.

    The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
 Traceback (most recent call last):
   File "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in <module>
  instruct_pipeline = pipeline(
       ^^^^^^^^^
   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 776, in pipeline
  framework, model = infer_framework_load_model(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
  raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
 ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).


I was referring to his TIL post about setting it up on paperspace, not about apple hardware.


ah, apologies, i misread your comment and was more excited to share since I was able to try on my system.


No worries, it happens. I will admit the way I answered wasn't clear that I was referring to the linked page and not the question in the post. All good.


I attempted using the Transformers library but failed. Not sure, might be a VRAM issue; I'm going to try on my far beefier personal MacBook Pro later tonight.


How much ram is likely needed on an apple arm for models like this? And for general use, 64, 96, 128? Trying to decide how large I should go for a new laptop.


I very recently purchased a MacBook Pro (M1 Max) with 64GB of ram. I haven't experimented that much, but I was able to run inference using the 65B parameter Llama model with quantized weights at a speed that was reasonably usable (maybe a touch slower than ChatGPT with GPT-4).

I haven't attempted to use the 65B model with non-quantized weights, but the smaller models work that way, if slowly. With 96GB of ram -- the upper limit of a MacBook Pro -- you might be able to use even larger models, but I think you'd hit the limits of useful performance before that point.

I should note that it can be a bit tricky getting things to work using the Mac's GPU. I couldn't get Dolly 6B to run on my work MBP, which theoretically should have enough ram, though I still want to try it on my personal laptop.


I see refurbished m1 2tb/128gb for $4700, looks like similar price for an m2 with same storage/ram with my corp discount (20cpu/48gpu). This is a tough decision.


AFAIK current models can run even with 64GB, but I would assume that we will very likely have bigger models very soon so I guess the answer is as much as you can afford


The next question is m1 or m2, and the impact of the various number of gpu units between pro, max, ultra skews. I'm really tempted to buy a "refurbished m1 studio" with 128gb because I think the ram is the key. Have not seen any benchmarks with diff # of gpus/aka diff skews.


I saw this: https://github.com/jankais3r/LLaMA_MPS

it runs slightly slower on the GPU than under llama.cpp but uses much less power doing so

I would guess the slowness is due to immaturity of the PyTorch MPS backend, the asitop graphs show it doing a bunch of cpu along with the gpu, so it might be inefficiently falling back to cpu for some ops and swapping layers back and forth (I have no idea, just guessing)


Hey, thanks so much. That solidifies the case for 128gb mac studio. Apple could be selling a bunch of these things with these high ram capabilities.


The answer is as large as you can afford, really. Future more unoptimized models are only going to be more hungry for RAM.


same same


Not on M1/M2 yet, but my response time seems pretty fast on Tesla V100-SXM2-16GB


I'm sure we'll see this by the end of the day or two.


Previous [flagged] discussion: https://news.ycombinator.com/item?id=35539085



> As outlined above, these results demonstrate that dolly-v2-12b is not state of the art, and in fact underperforms dolly-v1-6b in some evaluation benchmarks. We believe this owes to the composition and size of the underlying fine tuning datasets, but a robust statement as to the sources of these variations requires further study.

Taking a moment to appreciate the integrity of the team.


Ditto, this is release early release often without necessarily meaning move fast and break things. Other teams can do the equivalent of Alpaca to Llama and we can all learn for the next round.


One of the creators here - yeah, the thing we have our eyes on is the vector not the point.

It’s astounding how adaptable these open models are, even with just a quarter of the Alpaca data. We’re a team of machine learning engineers and hackers, not an AI science lab, but that’s kind of the point frankly - this whole exercise appears to be far easier that it might at first seem.


Why are they not doing metrics against GPT-3.5 and GPT-4? My understanding is Dolly performs significantly worse.


I haven't played with the model just yet - but just eye balling it's performance it's significantly worse. I'm surprised they don't have Pythia on there as that's what they're based on from my understanding.

At their performance level it's the most important to compare to GPT-neoX, and I do appreciate they aren't making the "95% of GPT4" claims that some fine-tuned llama models are.

EDIT: For databricks people: I'd love to see this compared with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.


Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that


Happy to see this type of work that is truly open source and commercially usable. Is this the entire corpus or a subset? Do you intend to release any new iterations?

I've been thinking of starting similar efforts at another BigCorp by hosting a UL2 or GPT-J instance.


15k is the entire corpus we have right now. Hopefully others can join up in releasing additional samples that can be merged in over time.

We'll definitely keep iterating on Dolly and releasing everything openly.


I’m not seeing how 15k q/a training can get you much other than the simplest things. Maybe that’s the point, get the ball rolling for people to add more training data?


What reasons do you have for believing that is true?

It seems plausible to me that a general autoregressive LLM that is capable of completing text wouldn't take that much fine-tuning to shift it from "text completion" to "instruction following".

After all, the raw GPT3 model can be made to follow instructions with just a few examples.

Consider the prompt:

    What is the capital of France?
Raw GPT3, not the newer instruction-tuned variants, does not understand it's being asked a question. It offers the completion:

    What is the capital of France? If a student answers with a word, 
    she is asked to identify the word. She is not asked whether the 
    capital of France is Paris. On the other hand, if the student
    answers by pointing to a map, she is asked to identify the capital
    of France. She is not asked whether it is Paris.
It just starts appending to the text.

But if you give it a few examples, it happily gets into instruction following mode:

    The following is a transcript between a human and a helpful
    AI assistant who answers questions and obeys commands.

    Human: How many eggs are in a dozen?
    AI: 12
    Human: Say "hello" 3 times
    AI: hello hello hello
    Human: What is the capital of France?
    AI: 
GPT3 completes "Paris" here.

If you can get decent instruction/question following behavior out of a 2-shot example prompt, why do you think 15k is small for this?


N-shot at inference-time is fundamentally different from training/fine-tuning which is inherently pre-inference-time.

Though it would be interesting to know if OpenAI has a few generic multishot inputs before the prompt.

It's all extremely cryptic what the actual context window and system prompt (assuming chatgpt even is using the same API the proles are given) is with them


The claim is not that they are fundamentally different or similar, the claim is that one doesn't need that much data to get instruction-following behavior from a raw autoregressive LLM. K-shot prompting shows that the capability to follow instructions is present in the model. It's just a matter of using fine-tuning to keep the model in that frame all the time without a K-shot prompt.


Just saying if you ask for capital of an obscure country that it hasn’t been trained on, you will not get the answer, so 15k will get you come general stuff only within the confines. Also, to code you will need pretty complete documentation for it to ingest and then enough examples on how the code is done


15k is not the full training corpus. The model is trained on huge swaths of internet text. 15k is just the fine-tuning corpus to show it how to follow instructions. Stuff like world capitals and such are already present in the model weights due to being trained on tons of internet text.

With the raw LLM, you can get the capital of Mongolia with the prompt "The capital of Mongolia is", i.e. text completion. The fine-tuning allows you to get at that information by asking questions or giving commands, e.g. "Tell me the capital of Mongolia"


It's used for fine tuning a pre-trained model. This takes an LLM that is already capable of emulating lots of different kinds of personalities, and narrows it down to act more like the examples. Since the heavy lifting has already been done, 15k examples of a chatbot following instructions they way you want has a significant effect.


Read about RLHF, i think you are misunderstanding what this will be used for.


A specific reference would help readers.


good point! https://huggingface.co/blog/rlhf :)

i think the resources out there so far are not great yet


Great to see more releases under open licenses!


> instruction: Why mobile is bad for human

> response: We are always engaged one phone which is not good.

Curious about the poor grammar in the response. Is it intentionally mimicking the style of the input instruction?


Anyone wanna convert this to GGML so we can run it with LLaMa.cpp?


"dolly-v2-12b is not a state-of-the-art generative language model and, though quantitative benchmarking is ongoing, is not designed to perform competitively with more modern model architectures or models subject to larger pretraining corpuses." from: https://huggingface.co/databricks/dolly-v2-12b


How does this compare to openai ? Curious if anyone has any anecdotes.


We don't expect this to be as good as the latest OpenAI GPT release. This is just to demonstrate that developing a conversation agent using an existing foundation model is not as hard as some may assume. Take a foundation model that is not capable of Q&A and tune it with a fairly small Q&A data and you get your in-house ChatGPT.

Disclaimer: I work at Databricks.


Thanks for the feedback. The potential edge with Dolly is huge. Building a firewalled model with custom corpus is a big deal. I have been experimenting with openai and even with public data (but really limiting to the domain), yields great improvements (openai may be stale because of cut off data). I am excited to see where Dolly goes.


Dolly appears to fundamentally be a tech demo advertising how you can use Databricks for compute. I honestly wouldn't expect them to take it that much further, particularly in the context of larger models that would be significantly more expensive to fine-tune. But I'm happy to be proven wrong.


I imagine they will sell fine tuning as a service to Databricks customers. If I put all my data into their lake I too can get my own custom ChatGPT. That's compelling.


I also see that as the use case and would find it useful. However I feel this is somewhat low-budget so far coming from such a large company.


We plan to continue working on it and invest more.


you are referring to the dolly model? I think the training set could achieve similar performance if we would fine tune similarly sized model


I don't think these upvotes are organic.


I don’t think you fully appreciate the value of the training corpus.


Why do you say that?


probably based on the situation 3hrs ago - https://news.ycombinator.com/item?id=35539085


There's a big difference between employees who got excited to see their work on hacker news and upvoted it and premeditated shill / astroturf campaign. We should pretty much assume that a San Fransisco based company is going to have significant readership / membership here.

One can easily see how a message over a company communicator could result in a surge of upvotes.


Agreed. I was just providing the context that the user asked for.


Mine was.


Amazing.

Love databricks.


Databricks is fine. I wasn't happy using it until they implemented the ability to work in a git repo, with proper file support, but that's gone some way to making it more usable to me. The interface sucks pretty hard, slowing down and using a significant amount of memory with only modestly high number of cells (where a Jupyterlab notebook would remain very snappy). I also wish there were a better story for local development; they've addressed this to some degree recently but I'm not sold on their solution.

It's certainly better than what we did prior to Databricks, which was roll our own in-house provisioning and notebook solution. I won't/can't go into too many details, but not only was it cumbersome and very buggy, but it was as if they designed it to encourage data scientists to spend as much money on compute as possible (only to panic at the millions they were spending). They dropped it for cost reasons, which is hilarious given how expensive Databricks is.

I do appreciate the work Databricks have done improving Spark. Capabilities like adaptive query execution have made optimization significantly easier.


When you say you wish they had a "better story for local development," what do you mean? What do you wish for?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: