Introducing the new Argilla SDK for simplifying diverse feedback projects
June 5, 2024
This post introduces Argilla’s new experimental python SDK, a toolkit for ML/ AI/ NLP engineers to define human feedback tasks and collect feedback data. The new SDK simplifies interactions with the Argilla server, focusing on streamlined approaches to managing your datasets, and aims to improve user experience and functionality. In turn, it should help to create high-quality datasets more efficiently.
Over the last 12 months, Argilla has been adapting its platform away from task-centric datasets that focused on natural language processing tasks like text classification or named entity recognition, toward extensible datasets targeting complex or multiple tasks. In turn, Argilla’s V1 SDK grew to support both dataset paradigms.
Recently, we asked the community for their opinion on the SDK’s development. We spoke to users from different backgrounds, from those deploying Argilla in production to those learning Argilla and human-centric ML for the first time. We also spoke with users tackling both classical NLP tasks like classification to those working on LLM projects like Direct Preference Optimization datasets. From those discussions, we’ve learnt two main things about the SDK:
- They loved and needed the switch to extensible tasks through the
FeedbackDataset
. - The growing functionality was becoming difficult to learn and stay up to date with.
Focusing on extensible datasets
Therefore, we decided that Argilla’s V2 SDK will focus purely on extensible datasets. This allows us to build a simpler cleaner SDK that focuses on core interactions with the Argilla server and extensible datasets. Moreover, by focusing on extensible datasets we can make the SDK easier to learn and use.
In this post, we will give an overview of the new SDK’s core changes and the motivation behind them. Those core tasks in the SDK are:
- Connecting to the Argilla server
- Defining feedback tasks through fields and questions
- Adding and updating records on the server
- Querying and collecting recordS for downstream task
Connecting to the Argilla server
The SDK’s most notable change is in its first line. We’ve moved away from the init method that defined a global variable and instead, we use a client object:
import argilla as rgclient = rg.Argilla( api_url="https://argilla.example.com", api_key="my_token",)
The client accepts configuration parameters or collects environment variables and is used to edit resources like a dataset on the Argilla server. You can have multiple clients or redefine them as you go. There’s no smoke and mirrors.
The client also gives you access to the resourceS that are already on your Argilla server:
my_dataset = client.datasets("my_dataset")my_workspace = client.workspaces("my_workspace")my_user = client.users("my_user")
For more information on how feedback teams can be managed, check out the in-depth guide here.
🤔 Why are we doing this? The
init
method saves a lot of time, or lines or code, but it wasn’t a transparent object containing information about the server. We see theclient
object as a gateway to your Argilla server.
Defining feedback tasks with settings
Since we’re focusing on extensible datasets, we will rename the FeedbackDataset
to just Dataset
. But, the most significant change in the SDK is the Settings
class. This class works in tandem with a Dataset
to define all of the fields, questions, vectors and metadata, for a dataset. You can see Settings
as your feedback task blueprint.
rg.Settings( fields = [ rg.TextField(name="text") ], questions = [ rg.LabelQuestion( name="label", labels=["label_1", "label_2", "label_3"] ) ], metadata = [ rg.TermsMetadataProperty( name="metadata", options=["option_1", "option_2", "option_3"] ) ], vectors = [ rg.VectorField(name="vector", dimensions=10) ], guidelines = "guidelines", allow_extra_metadata = True,)
Settings
keeps everything in one place so you can refine and maintain your dataset configuration, or use it across multiple datasets.
rg.Dataset( name="name", settings=settings,)
When you’re ready to share your dataset in the Argilla UI, just apply dataset.create()
.
🤔 Why are we doing this? Defining a feedback task is crucial to getting high-quality data. The
Settings
object allows your code to focus on that task and separate it from the dataset’s lifecycle.
Adding and updating records on the server
Adding records to an Argilla server has been refined to work with existing generic data structures. Argilla still uses a rg.Record
class for records in a dataset.
The refactored SDK uses a familiar rg.Record
class to represent records. The class has attributes and parameters for fields
,suggestions
, responses
, metadata
, and vectors
. These instantiated records can be passed directly to the revived log
method 🎉.
records = [ rg.Record( fields={ "question": "Do you need oxygen to breathe?", "answer": "Yes" }, ), rg.Record( fields={ "question": "What is the boiling point of water?", "answer": "100 degrees Celsius" }, ),]dataset.records.log(records)
Moreover, the new SDK will parse your data structures and instantiate records for you. This means that you can add records as a list of dictionaries like structures:
# ! Here we are defining dictionaries to illustrate the data structuredataset.records.add( records=[ { "question": "What is the capital of France?", # 'question' matches the `rg.TextField` name "answer": "Paris" # 'answer' matches the `rg.TextQuestion` name }, { "question": "What is the capital of Germany?", "answer": "Berlin" },])
This lets you leave your data in its native structure without having to define ingestion flows. In fact, the log
method accepts Hugging Face datasets too. So you can pass a dataset to the log
method.
ds = load_dataset("imdb")dataset.records.log(ds)
To make the process even simpler, the log
method takes a mapping from your keys to Argilla attributes like field, question, metadata, and vectors.
# IMDB dataset column name is 'label'ds = load_dataset("imdb")# Argilla dataset question name is 'sentiment'dataset.records.log(ds, mapping={"label":"sentiment"})
If you want to update existing records, you can also use the log
method and similar attributes.
# Update records in a datasetdataset.records.log( records=[ { "id": "1", # matches id used to add records "answer": "Paris", } ])
🤔 Why are we doing this? Data sources come in various states of quality and representation whilst the feedback task should be in one that UI users understand. By abstracting data ingestion, you can leave your data in its native form and focus on naming that suits the UI user.
Querying and collecting records for downstream tasks
To collect records from Argilla with queries you can use the Filter
class to define the conditions and pass them to the Dataset.records
attribute to fetch records based on the conditions. Conditions include "==", ">=", "<=", or "in" and can be combined with dot notation to filter records based on metadata, suggestions, or responses. Queries can also be combined with search terms.
# Create a range from 10 to 20range_filter = rg.Filter( [ ("metadata.count", ">=", 10), ("metadata.count", "<=", 20) ])# Query records with metadata count greater than 10 and less than 20query = rg.Query(filter=range_filter, query="paris")# Iterate over the resultsfor record in dataset.records(query=query): print(record)
The next steps
Over the next few weeks, we will roll out Argilla 2.0 and the new SDK shown here. Currently, the SDK is still experimental, but you’re welcome to try it out via the instructions below. We’re open to your feedback, issues, contributions, and opinions.
The experimental SDK is developed in this repository. Check out the README or documentation to install it.
What does this mean for my current projects?
Are you using extensible datasets like FeedbackDataset?
This will not impact your Argilla server. You can use the new SDK with argilla server version after 1.27
. You may choose to update the client code and you will find the changes to be minimal.
Are you using legacy task-specific datasets like TextClassificationDataset?
Task-specific datasets will not be supported in argilla server or client after version 1.29
(the current version). We will maintain security and bug fixes for this version, but if you require legacy datasets, you should remain on this version.