"Large Language Models and...Ghosting?"
Worry not, I am obviously not talking about ghosting, the act of “abruptly ending communication with someone without explanation”. If LLMs were super sentient, they would have some semblance of courtesy (a debatable topic, I agree, but let’s shelve that for now)
What I am talking about is something quite exciting for all of us interested in the GenAI space and which is possibly going to help make fine-tuning LLMs even more efficient.
And that’s the GAtt - The Ghost Attention technique!
With the recent news of META releasing LLaMa 2, the next iteration of mama LLaMA, and its availability on HuggingFace, comes the excitement of finally having access to a model that is “as good as” ChatGPT, which means - the bridge towards closed-source models is breaking away brick-by-brick. As someone who believes in the total democratization of anything-tech and being a proud leecher, I couldn’t be more excited!
What’s more - LLaMa 2 is now available on Azure as well! Huh - guess it makes sense to finally get that certification (for legal purposes, this is a joke)
But coming back to GAtt - what is Ghost Attention and how does it work?
The Attention mechanism in Transformers is the crux of Large Language Models. By doing away with recurrence and convolutions, which transformers had heavily leaned on, the transformer architecture effectively revolutionized how attention was used. Attention function is essentially mapping from a query to a set of key-value pairs and then to an output. The query, keys, and values that are used as inputs to these attention processes in the context of neural machine translation are several projections of the same input sentence.
Intuitively, the attention mechanism employs the concept of “self-attention” in transformers by identifying relationships between the different components (words, in this case) of the same sentence.
If you’re familiar with “Attention is all you need” by Vaswani et al, then you know that the attention mechanism which is generally followed is the scaled-dot product attention, shown below:
Source: Attention is all you need
Vaswani et al also proposes the use of a multi-head attention mechanism, which is built on top of the single attention function that takes Q, K and V matrices as input. The goal of multi-head attention, as shown in the image below, is to make it possible for the attention function to extract data from several representation subspaces, which is not possible when using a single attention head:
Source: Attention is all you need
While quite far from ideal, transformers are our best bet for contextualization. With the release of LLaMa 2, one could say we’re a step closer.
In “Llama 2: Open Foundation and Fine-Tuned Chat Models”, the LLaMa 2-CHAT has been employed with a new technique, Ghost Attention (GAtt) - which “helps control dialogue flow over multiple turns”
Consider a dialogue setup, where a specific constraint (or “instruction”) is set to be followed for each discussion turn. For example, we could ask the chatbot to “act like Shakespeare” or “speak only in Italian” or “respond in emojis”. When such instructions are provided, the subsequent response should always adhere to the constraint. But some of their models tend to forget the instruction after a few dialogue turns.
This issue with multi-turn memory can be fixed with Ghost Attention (GAtt)
GAtt is inspired by Context Distillation - a process where the context is distilled into the model during fine-tuning, which in turn helps the attention to focus in a multistage process so that there is dialogue control over multiple turns. In other words, the constraint or “instruction” is maintained throughout the conversation.
In simpler terms, it turns out just like that one parent who always talks to you with suspicion no matter what because you got into trouble that one time a long long time ago.
The way GAtt works is that a constraint, say “inst”, is defined so that it is maintained throughout the entirety of a dialogue. An example of an “inst” could be “speak in” or “act as”. It is then synthetically concatenated to all the user messages of a multi-turn dialogue dataset between two entities. So now we have a context-dialogue and the sample to finetune the model with.
LLaMa 2-CHAT was used to synthetically generate a list of constraints and the model was fine-tuned by random combinations of the constraints from the list. It was observed that GAtt was consistent till 20+ turns, until it reached the maximum context length.
The usage of GAtt is very much in its early stages but I see it going quite a long way. Future research and iterations on this method might probably help the model even more.
The rise of Large Language Models and its groundbreaking attention mechanism have ushered in a new era of natural language processing, revolutionizing how machines understand and generate human language. With ongoing research and development, we can expect LLMs to continue evolving and becoming more sophisticated, leading to even more impressive results. Moreover, LLMs have the capacity to become an integral part of our daily lives. Enhanced pre-training techniques, larger and more diverse datasets, and refining model architectures will foster LLMs' ability to grasp nuances, contextual cues, and rare language patterns, leading to a deeper understanding of language and cultural intricacies. Nevertheless, it is imperative to approach these technologies with ethical considerations at the forefront to ensure a responsible and inclusive future for LLMs and the transformative impact they have on our world.