RLHF and alternatives: Overview

Introduction

This is a series of blog posts related to alternatives to Reinforcement Learning by Human Feedback, created as a joint effort between Argilla and MantisNLP teams. Please make sure you have gone through the first, second, third, fourth, fifth, sixth, seventh, eighth and ninth entries in the series to fully understand the context and progression of the discussion before moving on to this segment.

In our previous blog posts, we've delved into various preference alignment algorithms, many of which have demonstrated promising results. However, a critical question arises: how do we select the most appropriate algorithm for our specific needs? Given that most evaluations in their respective studies were benchmarked against RLHF and DPO, this blog post aims to comprehensively analyze all previously discussed algorithms by examining their diverse attributes, advantages, and drawbacks in a side-by-side comparison.

A Brief Recap

Before we delve into the comparison, let's briefly recap the preference alignment algorithms we've discussed so far:

Reinforcement Learning by Human Feedback (RLHF)

In the context of LLMs, RLHF is a paradigm in which an agent learns to make decisions by receiving human feedback from reviewers who assess and rate the model's responses. By leveraging human expertise and judgments, reinforcement learning facilitates the iterative improvement of the model's performance and fine-tunes its responses.

The process starts with Supervised Fine-tuning (SFT), where the initial training takes place. Following this, a Reward Model (RM) is trained to evaluate the responses generated by the SFT model, assessing them for accuracy, relevance, and adherence to guidelines. At the same time, Proximal Policy Optimization (PPO) is employed to further refine the SFT model, using a combination of prompts, responses, and the rewards determined by the RM.

Direct Preference Optimization (DPO)

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Github repository here.

DPO simplifies the framework introduced by RLHF by aligning the LLM with human preferences without requiring the Reinforcement Learning step. It treats the constrained reward maximization problem as a classification problem on human preference data, directly defining the preference loss as a function of the Policy. Thus, DPO incorporates a two-step approach: applying SFT on the dataset(s) of interest and running preference learning on the SFT model using preference data.

Chain of Hindsight (CoH)

Paper: Chain of Hindsight aligns Language Models with Feedback. Github repository here.

Chain of Hindsight also gets rid of the Reinforcement Learning step. The key idea is that humans are capable of learning from rich and detailed feedback in the form of comparisons, and so are the LLMs. Thus, it applies SFT and PPO on human preference data in the form of contrastive information by converting all types of feedback into sequences of sentences.

Reinforcement Learning from AI Feedback (RLAIF)

Paper: A Critical Evaluation of AI Feedback for Aligning Large Language Models. Github repository here.

RLAIF is a novel approach that eliminates the need for human feedback by leveraging AI feedback. In this schema, the AI assistant incorporates feedback from another LLM rather than from humans, while being guided by the constitution (a set of humanly curated principles to influence the behavior of the AI assistant). Given one prompt and two responses to that prompt (in prompt-response tuples, duplicating the prompt), the RM from AI Feedback generates a score (between 0 and 1) for each pair in concordance with the constitution. The rest of the procedure is similar to RLHF.

Self-Play fIne-tuNing (SPIN)

Paper: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. Github repository here.

SPIN incorporates feedback without depending on human or AI feedback. The idea is that the model competes against its previous version without requiring direct supervision and creates its own training data. Through successive rounds of self-play, the model gradually improves, aiming to bring its responses closer and closer to human-like ones. The process ends when the most sophisticated version of the LLM can no longer discern between responses generated by its predecessor and those generated by humans.

Identity Preference Optimization (IPO)

Paper: A General Theoretical Paradigm to Understand Learning from Human Preferences. Github repository here.

IPO optimizes the preferences without relying on a reward model and, in scenarios, where preferences are deterministic, ensuring the effectivity of the KL-regularization. By replacing the logit function with the identity function, IPO optimizes preferences directly (learning from pairwise preferences instead of the logit preferences).

Kahneman-Tversky Optimization (KTO)

Paper: KTO: Model Alignment as Prospect Theoretic Optimization. Github repository here.

Based on Kahneman & Tversky’s prospect theory, KTO directly maximizes the utility of generations instead of maximizing the log-likelihood of the preferences. It requires a binary signal of whether output is desirable or not, and works by adding a KL penalty that rises if the model increases the reward of a desirable example in a generic way. This forces the model to learn what makes an output desirable so that the reward can be increased while the KL is kept flat.

Odds Ratio Preference Optimization (ORPO)

Paper: ORPO: Monolithic Preference Optimization without Reference Model. Github repository here.

ORPO combines instruction tuning and preference alignment in a single process, making it reference model-free and computationally more efficient. It creates a new objective by using an odds ratio-based loss to penalize undesirable responses along with conventional negative log-likelihood loss (NLL), allowing it to distinguish between favorable and unfavorable responses.

Joint Preference Optimization (DOVE)

Paper: Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization. Github repository here.

DOVE align language models by learning from preferences over instruction-response pairs. It adjusts the conditional probability of preferred responses over non-preferred ones, along with a correction factor based on the prior probability of the instructions according to the language model

Simple Preference Optimization (SimPO)

Paper: SimPO: Simple Preference Optimization with a Reference-Free Reward. Github repository here.

SimPO aims to align the reward and generative models, making it appear more intuitively correct. It directly uses the likelihood metric that guides generation in the reward model and adds a target reward margin in the Bradley-Terry objective.

What Does the Input Data Look Like?

Most of the mentioned methods follow two defined stages instruction tuning and preference alignment. The first one implies SFT and requires instruction data, that is, a dataset containing an instruction and a response. The second one requires preference data, which is a collection of alternatives to a prompt rated by human annotators from the best to the worst. Then, these ranked responses are normally organized in pairs of chosen/rejected, e.g. A > B > C > D will create the following pairs (A > B), (A > C), (A > D), (B > C), (B > D). This is the case for PPO, DPO, IPO, ORPO and SimPO.

*Example of instruction data.*
instruction	response
Human: How do you plant potatoes in pots?	Assistant: That depends. Are you planting seeds in a mixture of soil? Or are you putting live potatoes in pots?
Human: Who was Glyn Johns?	Assistant: He was an English musician who was a sound engineer and record producer.

*Example of preference data.*
prompt	chosen	rejected
Human: How do you plant potatoes in pots?	Assistant: That depends. Are you planting seeds in a mixture of soil? Or are you putting live potatoes in pots?	Assistant: Would you like me to explain how to plant potatoes in pots or give you instructions for doing it?
Human: Who was Glyn Johns?	Assistant: He was an English musician who was a sound engineer and record producer.	Assistant: Oh, that’s a weird question. I’m going to be honest with you - I don’t know, and Google can’t answer it. But I’ll explain what’s going on, and why I don’t know.

For DOVE, instead of comparing responses based on identical instructions, it compares pairs of instruction-response.

*Example of joint preference data.*
chosen	rejected
Human: How do you plant potatoes in pots? Assistant: That depends. Are you planting seeds in a mixture of soil? Or are you putting live potatoes in pots?	Human: Who was Glyn Johns? Assistant: He was an English musician who was a sound engineer and record producer.
Human: What is a verb? Assistant: A verb is an action word that describes an activity.	Human: What is REST API? Assistant: In practice, most REST APIs are using JSON over HTTP. REST APIs have several properties that make them very popular.

However, preparing preference data for alignment algorithms can be challenging. As a result, some methods autonomously generate their own preference data to enhance alignment throughout the RL process, allowing for continuous refinement of the model's alignment with desired outcomes without requiring extensive pre-existing datasets of human preferences. This is the case for SPIN. Initially, it only relies on a high-quality instruction dataset, where the completion column is considered as real/chosen and the new generations/rejected are generated at each iteration by the model to get a preference dataset (you can check some examples here). On the other hand, RLAIF directly uses as input the prompt with the possible responses (previously generated or generated by the same LLM) without any mark, as it is the LLM who is in charge of labeling them to be fed to the reward model.

In the case of KTO, although it has proved to benefit from preference data, its main advantage is that it doesn’t need it. Instead, it can work with binary signal data, that is, each annotation consists of a positive signal (+1) if the instruction is useful or acceptable, or negative (-1) if it is not. This way KTO also avoids the need for a pair-balanced dataset, as it can handle a desirable:undesirable ratio of, for instance, 1:10.

*Example of input data for KTO.*
prompt	label
Human: How do you plant potatoes in pots? Assistant: That depends. Are you planting seeds in a mixture of soil? Or are you putting live potatoes in pots?	👍
Human: How do you plant potatoes in pots? Assistant: Would you like me to explain how to plant potatoes in pots or give you instructions for doing it?	👎
Human: Who was Glyn Johns? Assistant: He was an English musician who was a sound engineer and record producer.	👍
Human: Who was Glyn Johns? Assistant: Oh, that’s a weird question. I’m going to be honest with you - I don’t know, and Google can’t answer it. But I’ll explain what’s going on, and why I don’t know.	👎

For CoH, the input data is a single sequence that includes the model outputs with feedback about their correctness, similar to how a person would explain that an answer that one answer is preferable over the other.

*Example of input data for CoH.*
User: How do you plant potatoes in pots? A helpful answer: That depends. Are you planting seeds in a mixture of soil? Or are you putting live potatoes in pots? An unhelpful answer: Would you like me to explain how to plant potatoes in pots or give you instructions for doing it?
User: Who was Glyn Johns? This answer "He was an English musician who was a sound engineer and record producer" is better than this answer "Oh, that’s a weird question. I’m going to be honest with you - I don’t know, and Google can’t answer it. But I’ll explain what’s going on, and why I don’t know".

So, as mentioned in previous posts, many preference alignment methods require high-quality human preference data, which can be both time-consuming and costly to acquire. Conversely, approaches that only rely on prompts and responses are easier to get, as well as the case of DOVE which uses pairs of prompt-response to rank them or KTO which doesn’t need paired preference data so annotation is smoother. Besides, to reproduce similar experiments, we can use the same datasets as in the original papers, as most of them are available to the public.

The table below shows the datasets used during experimentation and their original size. As we can observe Anthropic-HH is the most common and relatively large dataset. In general, while individual datasets are roughly similar in size, when combined their volume is substantial, once again highlighting the challenges in acquiring a significant well-curated dataset. The case of SPIN is noteworthy since most of the data is generated during the process the initial data size is minor.

*Datasets used to perform the experiments and their size. IPO paper does not reference any experimental setup.*
	Dataset	Size
DPO	Anthropic-HH (single-turn dialogue) IMDB (controlled sentiment generation) TL;DR (summarization)	170K 50K 1.33M
CoH	Combination of WebGPT Anthropic-HH Summarization	19.5K 170K 179K
RLAIF	ShareGPT	125K
SPIN	Ultrachat200k	Only used 50K
KTO	Combination of Anthropic-HH Oasst1 SHP	170K 84.4K 349K
ORPO	Anthropic-HH UltraFeedback Binarized	170K 62K
DOVE	Anthropic-HH TL;DR	170K 1.33M
SimPO	Ultrachat200K UltraFeedback Binarized	200K 62K

Can We Reproduce Them?

Considering the importance of these research studies, it is essential to reproduce them and continue experimenting with those approaches.

In this context, Transformer Reinforcement Learning is an open-source library from Hugging Face, offering a straightforward framework to train transformer language models with Reinforcement Learning. As of the time this post is being written, TRL can be used for SFT, RM, PPO, DPO, IPO (by setting loss_type="ipo" in the DPOTrainer), KTO, and ORPO. Axoltl and LLaMA-Factory, two tools to fine-tune LLM models, also implemented SFT, RM, PPO, DPO, and ORPO. For SPIN, we also released the distilabel-spin-dibt repository with code to reproduce the experiments using distilabel. In the other cases, useful code and information can be found on the original repositories mentioned in the recap section, although it slightly hinders reproducibility when using a different dataset or base model.

The alignment handbook of Hugging Face also provides us with some recipes to perform these experiments, e.g., Constitutional AI or the comparison of the DPO, IPO, and KTO methods, and to understand how some models were trained, such as the case of Zephyr-7b-beta, Zephyr 7B Gemma or Zephyr-141B-A35B with ORPO.

Regarding the base models, the most recent studies use open models such as the Mistral-7B models (SPIN, KTO, SimPO), Llama (RLAIF, KTO, ORPO, SimPO), Phi-2 (ORPO), or the OPT ones (CoH, ORPO), among others. This is quite encouraging as it allows us to easily replicate the paper research for verification or even improvement. However, in the case of RLAIF, it relies on GPT-3.5/4 or claude-v1, two closed-source models to label the data, which means that it needs some investment to use their API.

In our case, we have also conducted research and performed experiments with these algorithms. Using argilla and distilabel, we generated some meaningful datasets to align the models. For instance, you can find our collection of datasets as preference data for DPO or ORPO among others (highlighting some of them as ultrafeedback-binarized, distilabel-orca or capybara-preferences), prepared for KTO or for SPIN. During the process, we could reaffirm that well-curated data not only enhances the results but also allows for the reduction of the dataset size without compromising, and potentially even improving, the results. That was the case for the replication of SPIN with only 1.8K prompts in comparison to the 50K used in the original paper (you can find the datasets and models here). We could also observe this in ORPO, where only 7K instances of the capybara dataset were used to fine-tune the model, achieving comparable results using Zephyr, although we will still scale the experiments.

Instruction tuning, Preference tuning, or Combining Both?

SFT is key in RLHF as most of the approaches rely on it to perform the preference alignment. However, the process of preference tuning can vary, involving several steps or being unified in a single one. In the standard approach, RLHF requires three steps: SFT, RW, and PPO. Similarly, RLAIF follows the same stages, as although avoids using human-annotated data, once is labeled the same stages as in RLHF are performed.

As a result, to reduce costs and time, new methods arose such as DPO or CoH, which combined RW and PPO into a single one. Those were followed by IPO to address the DPO overfitting issue, KTO that fixes the need for preference data, leading to comparable results, or DOVE that changes the type of preference data. SimPO also makes training faster as it gets rid of the reference policy in its reward model. SPIN also avoids the implementation of a reward model, but, even if the results were promising, we should note that it needs several iterations.

On the other side, ORPO only relies on the base model as the preference alignment is performed during the SFT leading to a more efficient computation and resource savings.

AI Feedback: Yes or No?

Using AI feedback can be beneficial, especially if we want to obtain results quickly and fairly accurately. Such is the case of RLAIF, which directly uses an LLM as a labeler, or SPIN which during its iterations generates its own preference data until it is unable to discern between the ground truth and the generation.

Most of the experiments reported in the original papers use datasets that have been previously annotated and curated by humans. However, there are also others like WebGPT or UltraFeedback, that were generated using AI Feedback, and even in those cases, the results were significant. As mentioned in the KTO paper, both human and artificial feedback can be quite noisy and contradictory.

In this context, using AI feedback to generate the necessary preference datasets is now a widely accepted practice, offering substantial resource savings. However, it is advisable to ensure that such processes are consistently overseen by a human to maintain quality and reliability.

Conclusions

As we could observe, there isn't a single method that addresses all aspects effectively. Each approach has its advantages and disadvantages, and their effectiveness varies across different scenarios. Thus, the suitability of a method depends on the specific requirements at hand. However, this still proves that there are remaining aspects to be investigated.

*Overview table.*
	SFT	RW+PPPO	DPO	CoH	RLAIF	SPIN	IPO	KTO	ORPO	DOVE	SimPO
Data Format	prompt+chosen	prompt+chosen+ rejected -> prompt+response+reward	prompt+chosen+rejected	prompt_chosen_rejected	prompt+chosen+rejected	prompt+chosen -> prompt+chosen+rejected	prompt+chosen+rejected	prompt_response+label	prompt+chosen+rejected	chosen(prompt_response)+rejected(prompt_response)	prompt+chosen+rejected
Data Reqs (paper \| other)	10K	10K	170K \| 12K	300K	125K	50K \| 1.8K	- \| 12K	600K \| 12K	200K \| 7K	12K	200K
Compute	medium	high	medium	medium	high	high	medium	medium \| low	low	medium	low
Implementation	TRL: SFT Trainer	TRL: Reward Trainer and PPO Trainer	TRL: DPO Trainer	Official Repository	Official Repository	Official Repository	TRL: DPO Trainer for IPO	TRL: KTO Trainer	TRL: ORPO Trainer	Official Repository	Official Repository
AI Feedback	no	no	no	no	yes	yes	no	no	no	no	no
Stages	instruction	preference	preference	preference	instruction+preference	preference	preference	preference \| combined	combined	preference	preference
Paper	Training language models to follow instructions with human feedback	Training language models to follow instructions with human feedback	Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Chain of Hindsight aligns Language Models with Feedback	A Critical Evaluation of AI Feedback for Aligning Large Language Models	Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models	A General Theoretical Paradigm to Understand Learning from Human Preferences	KTO: Model Alignment as Prospect Theoretic Optimization	ORPO: Monolithic Preference Optimization without Reference Model	Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization	SimPO: Simple Preference Optimization with a Reference-Free Reward

Want to know more?

This is the seventh entry of a series of blog posts dedicated to alternatives to RLHF. The first, second, third, fourth, fifth, sixth, seventh, eighth, tenth, and eleventh posts of the series can be found on our website too.

Argilla and Mantis NLP teams are happy to help with any question you may have about preparation steps for training a LLM using Supervised fine-tuning, Reinforcement Learning, or Direct Preference Optimization.

All the data curation steps are currently supported by Argilla’s Data Platform for LLM, and from Mantis NLP we offer end-to-end support for the whole process.