Anomalous Token Fingerprinting: The GPT-4.5 Hypetrain and Reading Between the Words

Introduction

Certain words don't make sense to us. Certainly, certain words in a certain order, less so. A curious phenomenon emerged with GPT-3: "anomalous tokens" that cause models to generate strange outputs. Let's explore what these tokens are, how researchers are using them to "fingerprint" models, and what this might mean in the race to unmask the identity of "gpt2-chatbot" that recently appeared on the LMSYS Chatbot Arena.

The Speculative Hype Train

Timeline of Events

What are "anomalous tokens"?

Without having heard this term before, we can interpret this as: “anomalous“ + “token” = ~abnormal word/part of a word coupled with fingerprinting, a process of identification through distinct markers. These “distinct markers” are tokens, when messaged to the model and asked to repeat them back, “…result[s] in a previously undocumented failure mode for GPT-2 and GPT-3 models.” These “failure modes” are expressed through erratic and nonsensical messages, repeating back a completely different word, or even getting hostile upon request to repeat the tokens. These anomalous tokens are referred to as “unspeakable tokens”.

The Running Theory

" … [unspeakable tokens] are words that appeared in the GPT-2 training set frequently enough to be assigned tokens by the GPT-2 tokenizer… but which didn’t appear in the more curated GPT-J and GPT-3 training sets. So the embeddings for these tokens may never have been updated by actual usages of the words during the training of these newer models. This might explain why the models aren’t able to repeat them — they never saw them spoken."

The Research

Using gradient descent, you can tweak a random image to trigger certain responses, and you’ll be able to see what features a neural network uses to recognize things like goldfishes, or flamingos. It doesn’t give us the actual objects, but rather the specific details the network looks for, to help us check if it’s learning correctly. Turns out, you can use this for text-based prompt generation as well, but with a bit of a twist.

Estimating Embedding Distances and Finding Semantic Relationships

Even though LLM tokens are discrete (while pixel tokens used in images aren’t), they can be mapped to embeddings, giving the previously ethereal and wispy word ideas the continuous space needed to draw word-meaning relationships between them on a lower-level plane.

Here's a sample of some code used to generate optimal inputs given a desired token output. Full repo here.

from time import time target_output = " the best player?" input_len = 2 base_input = word_embeddings[tokenizer.encode(target_output)].mean(dim=0) base_input = base_input.repeat(1, input_len, 1) tic = time() oi = optimise_input(base_input=base_input, plt_loss=False, verbose=2, epochs=500, lr_decay=False, return_early=False, lr=0.1, batch_size=20, target_output=target_output, output_len=4, input_len=input_len, w_freq=20, dist_reg=1, perp_reg=0, loss_type='log_prob_loss', noise_coeff = 0.75) toc = time() tt = toc - tic print('Time Taken: ', tt)

This code optimizes the input embeddings to trigger a desired output token using gradient descent. By tweaking the input and observing the model’s outputs, we can gain insights into what features and associations the model has learned.

The sets of best input + outputs were then mapped to embeddings, then ran through a k-means clustering algorithm (ELI5: process of iteratively grouping data points based on feature similarities). In this case, it was used to group semantically related tokens and identify which ones were closest to the overall centroid of the data set in the embedding space. Through this process of centroid-proximate estimation, researchers made a surprising discovery about the nature of these anomalous tokens and what they represent.

What Came of It

The discovery of "SolidGoldMagikarp" and other anomalous tokens revealed that these tokens were closer to the centroid of the entire training data, rather than pointing to their respective cluster centroids as originally thought.

These tokens were closer to the centroid of the entire training data (roughly said), rather than pointing to their respective cluster centroids as they originally thought. These “anomalous tokens” as text prompts let us make sense of the tokenizer behavior, and therefore narrow down from which GPT dataset the model was trained on from running prompts like these:

Image

Implications and Future Applications

The finding of anomalous token fingerprints has some pretty big implications beyond just figuring out where a model comes from. It could help us check for biases and limitations in models based on the data they were trained on, or see what kind of associations and knowledge they’ve truly learned. As language models start being used more and more in everyday life, this kind of forensic analysis could be key in making sure they’re transparent, fair, and doing what they’re supposed to do.

The Takeaway

Using these “weird tokens”, “glitch tokens”, “aberrant tokens”… what have you, give us a cool peek into how LLMs work behind the scenes. By causing models to generate unexpected or even straight rude outputs, they reveal the edges of models’ learned knowledge and the quirks of their training data. Like the case of the mystery of gpt2-chatbot, you too, dear reader, can use sleeper-cell activation words to sleuth out hidden lineages and capabilities of these models. Other strategies include prompting engineering techniques, which I’ll cover in a later article. As research continues in the fields of AI, ML, and NLP, I’m sure it’ll help us understand how to better audit, refine, and align language models to be safer and more reliable. Neat, right?.


Thanks for reading! This is a bastardization from me using AI to try to translate my Medium article into a blog post on my website.

Back to Home