RLHF and alternatives: DPO and CoH

RLHF and alternatives: DPO and CoH

August 30, 2023

Argilla, MantisNLP


This is a series of blog posts related to alternatives to Reinforcement Learning by Human Feedback, created as a joint effort between Argilla and MantisNLP teams. Make sure you read the first and second posts of series before proceeding to this part.

At the end of our first blog post, we analyzed the efforts required to carry out Supervised Fine-tuning on a pretrained LLM and how important the Instruction data is for this step.

On our second blog post, we described the process of Reinforcement Learning, which requires high-quality Preference Data.

The process may seem a little complex, and very data-greedy. Maybe now you understand why it has been depicted as a Shoggoth monster with a cherry on top. Courtesy of twitter.com/anthrupad

Shoggoth with Smiley Face

However, we start to see some light at the end of the tunnel, as the community starts working on simplifications of the process. Let’s talk about two of the alternatives to RLHF.

Direct Preference Optimization (DPO)

In Direct Preference Optimization: Your Language Model is Secretly a Reward Model[1] the authors introduce a method called Direct Preference Optimization (DPO) for achieving precise control over Large-scale unsupervised Language Models (LLMs). As mentioned above, existing methods rely on Reinforcement Learning from Human Feedback (RLHF) are based on training a Reward Model and using Proximal Policy Optimization (PPO) (as described in our second post), to align the LM’s output with human preferences. But this approach is very complex and unstable.

DPO solves the constrained reward maximization problem by treating it as a classification problem on human preference data. The algorithm is stable, efficient, and computationally lightweight, eliminating the need for reward model fitting, LM sampling, and extensive hyperparameter tuning.

Unlike existing methods that train a Reward Model and then optimizes a Policy based on that RM (Reinforcement Learning by Human Feedback using PPO — left of the figure), DPO directly defines the preference loss as a function of the Policy (right). In short, it removes the requirement of training first a Reward Model.

Direct Preference Optimization

The DPO pipeline has two stages:

  1. Run supervised fine-tuning (SFT) on the dataset(s) of interest.
  2. Run preference learning on the model from step 1, using Preference Data (ideally from the same distribution as the SFT examples)

As you see, Preference Data is still required to train the Policy. But it gets rid of the complexity of training a Reward Model first to then optimize a LM Policy to then fine-tune a LLM.

How is that achieved? The idea is using the very same LLM as a Reward Model. During fine-tuning time,DPO optimizes the policy using a binary cross-entropy objective. It uses the human preference data to determine what responses are preferred and what responses are not. By comparing the model's responses to the preferred ones, it can adjust the policy to improve its performance.

Chain of Hindsight

In the Chain of Hindsight Aligns Language Models with Feedback [5] paper another interesting idea is presented, again using the LLM as a means to optimize the Policy. The key idea this time is that humans are capable of learning from rich and detailed feedback in the form of comparisons.

In this scenario, we use the Preference Data to complement the input. The authors use an example with 2 answers in the PD dataset, one is helpful, another is not. The helpful one is added following “A helpful answer:”, and the unhelpful answer too in a similar way, both after the input.

On the left, preference data. On the right, the concatenation of the input, the helpful answer and the unhelpful.

During training, the model is presented with feedback consisting of the input and the contrastive data (helpful vs unhelpful, good vs bad, etc) as we expressed above. The model is conditioned to predict outputs that better match the good alternative. Loss is applied on the corresponding outputs “{a good answer}” and “{a bad answer}”. At inference time, positive feedback “Good:” to instruct the model to generate the desired outputs.

In conclusion, by applying SFT and PPO on the Human Preference Data in the form of contrastive information, with Chain of Hindsight we can leverage the power of LLM to optimize the Policy without needing to train an external Reward Model.

Direct Preference Optimization: Hands-on.

From the two techniques explained above, we are going to showcase the easy steps of applying Direct Preference optimization to fine-tune a LLM.

In the following link, you will be able to find the new DPO Trainer module, inside the TRL (Transformer Reinforcement Learning) library.

The steps to use the DPO Trainer instead RLHF described in the documentation are the following:

  1. About SFT, they indeed mention that it’s the first requirement, as we commented before(...The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm). This is also very important since the data in SFT should be in-distribution, to be able to run DPO (or any other technique).
  2. After, they describe the preference data format. They use Anthropic/hh-rlhf’s format, which uses, as mentioned above, a binary classification of accepted-rejected:

Anthropic’s hh-rlhf dataset, renderer in Hugging Face Dataset viewer

The data and consists of three fields:

  1. prompt, containing the context inputs.
  2. chosen, containing the corresponding chosen responses
  3. rejected, containing the rejected answers.

Fields required by DPO

Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:

  1. model, the fine-tuned version of your model (the result from SFT);
  2. model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.
  3. training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.
  4. beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5.
  5. train_dataset, the preference data in the format described above inside a Dataset class object.
  6. tokenizer, the tokenizer instantiated (usually with Autotokenizer) from the model.

Syntax of the DPOTrainer in Hugging Face’s trl

With that, the only command still to run is dpo_trainer.train() and you are ready to go!


fine-tuning a large language model (LLM) for generating text using specific guidelines can be highly useful in various applications. By providing specific guidelines, we can shape the LLM's output to align with our desired objectives, ensuring more accurate and controlled text generation.

This process has been historically very complex, requiring to train first a Reward Model to then optimize the Policy of the LLM.

However, recent approaches, such as Direct Policy Optimization (DPO) and Chain of Hindsight (CoH) have shown how you can use Large Language as Reward Models.

More specifically, we have shown:

  1. How to use Argilla to create the data for the Supervised fine-tuning step and also the Preference Data
  2. How to use Hugging Face’s trl library, more specifically DPOTrainer, to carry out Policy Optimization with the preference data.

Want to know more?

This is the third entry of a series of blog posts dedicated to alternatives to RLHF. The first and second posts of the series can be found on our website too. See the next part of this blog series here.

Argilla and Mantis NLP teams are happy to help with any question you may have about preparation steps for training a LLM using Supervised fine-tuning, Reinforcement Learning or Direct Preference Optimization.

All the data curation steps are currently supported by Argilla’s Data Platform for LLM, and from Mantis NLP we offer end-to-end support of the whole process.