Accelerating Red Cross aid for Ukrainian refugees using topic modelling

Can you say that impressive language models such as GPT are becoming increasingly capable and that these so-called Large Language Models can also make a difference in the humanitarian field? In this blog, Pippelaar Nina van Diermen looks at how BERTopic can support the Red Cross in processing Telegram messages. What does BERTopic do, how does it work and what solutions does it offer for the Red Cross?

Nina van Diermen
Data & AI Consultant
Contents

Social Media Listening

Over the past two years, the Red Cross has been monitoring public Telegram channels that Ukrainian refugees use to share information with each other. On these channels, they discuss all kinds of issues they encounter as refugees. Based on these messages, local Red Cross organisations can be assigned specific tasks that better respond to the needs of the refugees. However, these needs are difficult to identify from large amounts of unorganised Telegram messages. The Red Cross therefore categorises the messages under different headings, such as shelter or health. However, devising a suitable categorisation scheme has taken them a lot of time, as they have had to repeatedly sift through this large number of messages. This time could be spent in other useful ways within an organisation such as the Red Cross. They would therefore benefit greatly if the time required for this could be limited in the event of a new disaster.

BERTopic

When processing Telegram messages, Topic Modelling has proven to be a time-saving alternative method that does not require the user to define labels themselves. This method clusters the messages and then provides a concise description of what is being discussed in the cluster in question, also known as a topic. BERTopic is a method of Topic Modelling that utilises the power of LLMs. Broadly speaking, BERTopic consists of four steps, which are illustrated below.

BERTopic bestaat uit 4 stappen

BERTopic consists of 4 steps

The LLM is used in the first step: creating the embeddings. The embeddings are numerical representations of the messages, which contain the semantic content. The power of an LLM in creating such embeddings is due, on the one hand, to its Transformer structure. Thanks to this structure, the model can understand the meaning of words in their context, even if the words are far apart. On the other hand, this power comes from the large amounts of text data on which the model is pre-trained. During this pre-training, the model acquires a great deal of knowledge about language. The model can then be fine-tuned for more specific tasks. For BERTopic, an LLM is recommended that is fine-tuned to predict semantic similarity between texts.

Ultimately, the numerical representations of the messages will be clustered. In general, however, the embeddings contain a large number of features. Due to “the curse of dimensionality”, this can be problematic when clustering. According to this phenomenon, the concept of distance disappears when you have data that consists of many features, and this is precisely the aspect on which clustering is based. To prevent this, the dimension is first reduced before clustering is applied.

After clustering, you have a grouping of Telegram messages based on semantic similarities. The messages within a cluster therefore discuss more or less the same thing; they talk about a specific topic. To represent each topic, a representation is formed for each cluster. This consists of words that occur frequently in that specific cluster. In this way, no categorisation scheme is needed, but BERTopic groups the data and produces a kind of label itself.

Application

Finally, BERTopic was also applied to the available Telegram data from the Red Cross to see if it could offer a solution in future situations. Overall, the results are very promising. The following topics were derived from the data:

BERTopic topics gebaseerd op de beschikbare Telegram data van het Rode Kruis

BERTopic topics based on the available Telegram data of the Red Cross

Although the representations are not always equally informative, most topics seem to distinguish well in the messages. In the case of Ukrainian refugees, the current classification model is useful for the Red Cross, given its explicit focus on what is important to them. Nevertheless, it does show what possibilities there are in the event of a new disaster. As an alternative to the standard representations of the topics, more advanced methods can be chosen to create labels or even entire summaries of the topics. In this way, an overview of the discussion on social media about the disaster can be obtained quickly. Despite the lower quality of the grouping, BERTopic saves a lot of time in defining the categorisation scheme. It may be possible to strike a good balance by iteratively combining your own labels with the results of Topic Modelling.

 

Gerelateerde artikelen

Red Cross:  “You can’t think of it as crazy or it’s in our calculation model”

Questions about the latest news, events and PR?

Rob can tell you everything about our organization, mission and vision. He would love to get in touch with you!

Rob Tillemans
Commercial Director
commercie@pipple.nl