It's Veganuary! Named Entity Recognition with Rubrix

Hi everyone! This is Leire, and I am here to talk about two very cool topics: Veganuary and Named Entity Recognition (NER).

In this article, we will learn more about NER and how we can handle this task with Rubrix. To do so, and as January has just ended, we have chosen a very appropriate and interesting topic: Veganuary!.

But first of all, I must thank David for his help and contributions, especially for helping with the Hugging Face model. Check his Github profile!

Introduction...

Before getting started, it would be great to explain a little bit about Veganuary and NER:

Named Entity Recognition is a Natural Language Processing task by which it is possible to identify and name entities in a text sequence. These entities are classified into different groups, such as names, locations, events, currencies, and much more. If you want to learn more about NER and token classification in general, you can read this very interesting article by the folks at Hugging Face.
Veganuary is a quite famous challenge (especially in English-speaking countries) that, according to Wikipedia, "promotes and educates about veganism by encouraging people to follow a vegan lifestyle for the month of January".

...and objective

People post stuff on their social networks all the time. In January (or should I say Veganuary?), people following the challenge usually post the vegan food they eat. And they write a caption. So we thought, "we could explore Twitter, see what do people say about Veganuary meals and vegan food trends, and make a NER task!!".

Said and done! For our rigorous analysis 😉, we planned to extract all mentions related to food from about 10k tweets people posted during Veganuary, and simply have a look at it. So in this article, we will walk you through our little experiment and present our (slightly surprising?) results we obtained 🤩.

The process

To do this task, four steps are necessary:

Retrieve tweets from Twitter
Preprocess and tokenize tweets
Annotate tweets with Rubrix
Train a Hugging Face model and make predictions

1. Retrieve tweets from Twitter

With a maximum length of 280 characters, tweets seem to be convenient to process. They are also easier to annotate than lengthy documents. We used searchtweets to retrieve all tweets between the 11th and 18th of Veganuary that contained one of the following hashtags:

#veganfood, #veganideas, #veganrecipes, #veganuary, #veganuary2022

To make our life a bit easier 😌, we discarded all retweets and replies. This left us with around 7900 tweets that we used for our little experiment.

Retrieved tweet

2. Preprocess and tokenize tweets

After we obtained the necessary data, Rubrix came in.

We preprocessed the tweets replacing URLs, users, and emojis so that the model could focus on the actual text of the tweet. For this, we used the awesome and practical pysentimiento library.

As we wanted to perform a token classification task (NER is one of them), we tokenized the tweets with the amazing spaCy library. SpaCy ships with lots of pretrained pipelines that support lots of languages and tasks, such as part-of-speech tagging, text classification, or NER.

In the end, we create Rubrix records providing the text and the tokens.

import spacyimport pandas as pdfrom pysentimiento.preprocessing import preprocess_tweet# read in the tweetstweets = pd.read_json('tweets.json')# we use spaCy to tokenize our datanlp = spacy.load("en_core_web_sm")# iterate over tweets, and save them as Rubrix recordsrecords = []for tweet in tweets.iterrows():    # preprocess tweets (substitute urls, users, emojis)    text = preprocess_tweet(tweet.text, lang="en")    # tokenize the text    tokens = [token.text for token in nlp(text)]    # create Rubrix record, and add it to the list    record = rb.TokenClassificationRecord(text=text, tokens=tokens)    records.append(record)

After creating the records for our 7899 tweets in a matter of minutes, we are ready to log them to Rubrix:

# log records to a dataset called "veganuary"rb.log(records=records, name="veganuary")

Et voilà! Plenty of preprocessed and tokenized tweets were uploaded to Rubrix, ready to be annotated! 🍾📝

3. Annotate tweets

With the help of Rubrix, we manually annotated 500 out of 7899 records (actually, 501 🤪) within a couple of hours. To do so, we created the label "FOOD". Following this, we annotated every food entity we found on the records.

This is a very good example:

Screenshot of the annotation

We obtained 501 annotations with interesting results (see below). This not only gave us a better insight into the data, but it was also necessary for training a model to make predictions for the rest of the tweets.

Something to point out was a negative fact: we dealt with a lot of spam and advertising. There were some records which seemed to be spam or ads, so maybe it biased the results in terms of showing "trendy products".

4. Train a Hugging Face model and make predictions

With all the necessary annotations made, we wanted to train a Hugging Face transformer 🤗. As this is a more complex phase, I will explain it briefly.

First of all, someone may be wondering: what is a transformer? Transformers are a specific type of model that is often pretrained, and hence requires fewer data to be trained (in this case rather fine-tuned). The amazing Hugging Face Hub provides thousands of these transformer models that were pretrained on different languages and tasks.

For training our veganuary model, we chose the twitter-roberta-base model that was pretrained on almost 60 million tweets 🤯.

But before training the transformer, we needed to load the dataset and our annotations from Rubrix. We also had to transform the annotations into another format known as NER tags: while Rubrix expresses annotations simply in terms of text spans, for the training we also need to take into account which tokens these text spans encompass. If you want to know more about NER tags, we recommend this short but concise wikipedia article.

from spacy.training import offsets_to_biluo_tags, biluo_to_iob# Load only the annotated records from Rubrixtweets = rb.load('veganuary', query="status:Validated")# transform text spans to ner tags (BIO format)def spans_to_tags(row):    doc = nlp(row["text"])    entities = [(entity[1], entity[2], entity[0]) for entity in row['annotation']]    biluo_tags = offsets_to_biluo_tags(doc, entities)    return biluo_to_iob(biluo_tags)tweets["ner_tags"] = tweets.apply(spans_to_tags, axis=1)

With our NER tags ready, we finally could train our model following this very comprehensive tutorial by the Hugging Face team 🤗.

After the training, we used our model in a Hugging Face pipeline to extract all the food mentions in our veganuary tweets. Are you already curious about which food was mentioned the most? 😎

The dataset: results, findings and thoughts

So, without further ado, here comes the list of the TOP 20 food mentions:

Food	Count
meat	488
cheese	136
dairy	116
milk	104
pizza	84
tofu	83
chocolate	77
Meat	75
rice	65
burger	62
vegetables	57
chicken	56
salad	48
fish	47
garlic	46
soup	45
mushrooms	43
burgers	41
pasta	40
cake	39

The first 10 foods range from meat 🥩 to chicken 🍗, but the following ones are also remarkable. Firstly, in both annotation and prediction processes, meat was the predominant entity.

Why? Maybe it is because lots of vegan products emulate meat — this is a cultural issue, and Beyond Burgers are realistic and yummy anyway —, maybe there is a kind of activism on these tweets. After all, veganism has to do with politics, as well as revolving around food. The same applies to word entities like cheese 🧀 or milk 🥛. Other results, such as tofu 🍢, vegetables 🍆 or rice 🍚 seem pretty logical, while pizza 🍕, pasta 🍝 or burger 🍔 might show trends in vegan gastronomy.

Also, in both processes we noticed a "battle": meat and tofu were among the most common entities. So, it seems that some people are trying to experiment and adapt vegan cuisine to the traditional one, whereas other people stick to vegan staples (i.e. tofu, rice, mushrooms 🍄 ...).

Of course, there are many more aspects to discuss, but we would like to know what you guys think! 👀.

Summary

With this experiment, we killed two birds with one stone 🐦 🤯: We made a super interesting NER task, and we learned a lot about Veganuary and vegan food.

With Rubrix features, we could annotate and predict lots of records easily, and in a short period of time.

If you want to reproduce our results, here is the repo 💻 where you can find all the data and notebooks you need! We also open-sourced the data 💾, and created a simple app ✨, so you can extract food mentions from tweets yourself.

I hope you found this interesting. Named entity recognition has a lot to offer!

It's Veganuary! Named Entity Recognition with Rubrix

Introduction...

...and objective

The process

1. Retrieve tweets from Twitter

2. Preprocess and tokenize tweets

3. Annotate tweets

4. Train a Hugging Face model and make predictions

The dataset: results, findings and thoughts

Summary

Build Fine-Tuning and Evaluation datasets on the Hub — No Code Required

Argilla 2.0 is out

RLHF and alternatives: SimPO

It's Veganuary! Named Entity Recognition with Rubrix

Stay in the loop!

Build Fine-Tuning and Evaluation datasets on the Hub — No Code Required

Argilla 2.0 is out

RLHF and alternatives: SimPO