Notus

Meet Notus-7B: Data Curation and Open Science go a long way in shaping AI's future

December 1, 2023

Álvaro Bartolomé, Gabriel Martín, Daniel Vila Suero

Notus 7B is a new open source LLM, fine-tuned using Direct Preference Optimization (DPO) and AIF (AI Feedback) techniques. This model is fine-tuned with a new version of the Ultrafeedback dataset.

Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO.

In particular, when we started building distilabel, we invested time understanding and deep-diving into the UltraFeedback dataset. Using Argilla, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses (more details in the training data section). After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique overall_score, and verified the new dataset with Argilla.

Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset, we used DPO to fine-tune Notus, a 7B model, that surpasses both Zephyr-7B-beta and Claude 2 in the AlpacaEval benchmark.

🙌 This model wouldn't have been possible without the amazing Alignment Handbook, OpenBMB for releasing the Ultrafeedback dataset, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: high-quality data.

If you are as excited as we are about the synergy between human and AI feedback and how well DPO works for improving LLMs, check out ⚗️distilabel. We'd love contributions and early feedback!

Model summary

Data

Let's start with the most important: the data! DPO and other RLHF methods (like reward models for PPO) use a type of dataset called "preference data", "preferences", or "comparison" data. Preferences are a set of prompt and several responses indicating human (or AI) preference, esentially answering the question: which response is preferred or more appropriate. An amazing open preference dataset is UltraFeedback, used to power both Zephyr, Tulu, and now Notus.

Notus uses a curated version of openbmb/UltraFeedback, named argilla/ultrafeedback-binarized-preferences.

After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the overall_score in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.

By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the overall score was very high (the highest in fact: 10).

See screenshot below for one example of this issue.

image/png

After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.

While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!

The chart below explains the difference between the data used by Zephyr and Notus. Zephyr used the critique scores, while we decided to use the mean of preference ratings for each of the different preference aspects, namely: helpfulness, honesty, instruction-following, and truthfulness.

image/png

Important note: While we opted for the average of ratings while we fix the dataset, there's still a very interesting open question: once data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!

You can find more details about the dataset analysis and curation on the ultrafeedback-binarized-preferences dataset card.

Performance

Chat benchmarks

Table adapted from Zephyr-7b-β and Starling's original tables for MT-Bench and AlpacaEval benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.

Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr and Claude 2 on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.

Model Size Alignment MT-Bench (score) AlpacaEval (win rate %) License
GPT-4-turbo - ? 9.32 97.70 Proprietary
XwinLM 70b V0.1 70B dPPO - 95.57 LLaMA 2 License
GPT-4 - RLHF 8.99 95.03 Proprietary
Tulu 2+DPO 70B V0.1 70B dDPO 6.29 95.28 Proprietary
LLaMA2 Chat 70B 70B RLHF 6.86 92.66 LLaMA 2 License
Starling-7B 7B C-RLFT + APA 8.09 91.99 CC-BY-NC-4.0
Notus-7b-v1 7B dDPO 7.30 91.42 MIT
Claude 2 - RLHF 8.06 91.36 Proprietary
Zephyr-7b-β 7B dDPO 7.34 90.60 MIT
Cohere Command - RLHF - 90.62 Proprietary
GPT-3.5-turbo - RLHF 7.94 89.37 Proprietary

Academic benchmarks

Results from OpenLLM Leaderboard:

ModelAverageARCHellaSwagMMLUTruthfulQAWinograndeGSM8KDROP
Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta)52.1562.0384.3661.0757.4577.7412.749.66
argilla/notus-7b-v152.8964.5984.7863.0354.3779.415.168.91

⚠️ As pointed out by AllenAI researchers, UltraFeedback contains prompts from the TruthfulQA dataset so the results we show on that benchmark are likely not accurate for both Notus and Zephyr. When we trained Notus, we were not aware of this issue so it used TruthfulQA prompts and preferences included in UltraFeedback. For future releases, we will remove TruthfulQA prompts.

Training Details

Training Hardware

We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting we also explored other cloud providers such as GCP.

Prompt template

We use the same prompt template as HuggingFaceH4/zephyr-7b-beta:

Usage

You will first need to install transformers and accelerate, then you can run any of the following:

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

Via generate

import torchfrom transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")messages = [    {        "role": "system",        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",    },    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},]inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Via pipeline method

import torchfrom transformers import pipelinepipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")messages = [    {        "role": "system",        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",    },    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},]prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)generated_text = outputs[0]["generated_text"]

If you are like this work and are interested in data quality, AIF, LLMs and NLP, join our discord and leave a star on our repos: distilabel and argilla!