AI Content Detectors

Detection of LLM Generated Text - Binoculars

Binoculars propose measuring whether the tokens that appear in a string are surprising relative to the baseline perplexity of an LLM acting on the same string. A string might have properties that result in high perplexity when completed by any agent, machine or human. But the expectation is that next-token choices of humans to be even higher perplexity than those of a machine. By normalizing the observed perplexity by the expected perplexity of a machine acting on the same text, we can arrive at a detection metric that is fairly invariant to the prompt. The Binoculars score is a general mechanism that captures a statistical signature of machine text.

Formula:

  1. Probability of next token

A string of characters s can be parsed into tokens and represented as a list of token indices ⃗x by a tokenizer T. Let xi denote the token ID of the i-th token, which refers to an entry in the LLMs vocabulary V = {1, 2..., n}. Given a token sequence as input, a language model M predicts the next token by outputting a probability distribution over the vocabulary:

image-20240417-010332.png
  1. Perplexity - Average negative log-likelihood of all tokens

We will abuse notation and abbreviate M(T(s)) as M(s) where the tokenizer is implicitly the one used in training M. For our purposes, we define log PPL, the log-perplexity, as the average negative log-likelihood of all tokens in the given sequence. Formally, let

image-20240417-012403.png
  1. Cross Perplexity - How surprising the output of one model to another

Intuitively, log-perplexity measures how “surprising” a string is to a language model. As mentioned above, perplexity has been used to detect LLMs, as humans produce more surprising text than LLMs. This is reasonable, as log PPL is also the loss function used to train generative LLMs, and models are likely to score their own outputs as unsurprising.

Our method also measures how surprising the output of one model is to another. We define the cross-perplexity, which takes two models and a string as its arguments. Let log X-PPLM1,M2 (s) measure the average per-token cross-entropy between the outputs of two models, M1 and M2 , when operating on the tokenization of s.

image-20240417-010625.png

Note that · denotes the dot product between two vector-valued quantities

  1. Binoculars Score

Binoculars score B as a sort of normalization or reorientation of perplexity. In particular we look at the ratio of perplexity to cross-perplexity.

image-20240417-011206.png

Here, the numerator is simply the perplexity, which measures how surprising a string is to M1. The denominator measures how surprising the token predictions of M2 are when observed by M1. Intuitively, we expect a human to diverge from M1 more than M2 diverges from M1, provided the LLMs M1 and M2 are more similar to each other than they are to a human. The Binoculars score is a general mechanism that captures a statistical signature of machine text. In the sections below, we show that for most obvious choices of M1 and M2, Binoculars does separate machine and human text much better than perplexity alone. Importantly, it is capable of detecting generic machine-text generated by a third model altogether

 

Different combinations of scoring models, evaluated on datasets:

Screenshot (21)-20240517-092325.png

As shown in table, it is observed that using Falcon-7B as input models has some competitive edge in performance. So, we are using Falcon-7B and Falcon-7B-instruct models as X-Cross PPL Scorer and PPL Scorer.

Test Results

We have tested the accuracy with three different data sets.

  1. 500 texts generated by human

  2. 500 texts generated by AI

  3. 50 texts were generated by AI with the prompt (make it like written by human)

image-20240417-013015.png

Paper - https://arxiv.org/abs/2401.12070

SynthID-Text

Link of the research paper: Scalable watermarking for identifying large language model outputs | Nature

Introduction

In today's rapidly evolving technological landscape, artificial intelligence (AI) has significantly transformed various sectors, including content creation. While the potential benefits of AI are substantial, challenges arise, especially in distinguishing between content created by humans and that generated by AI. To tackle this issue, Google DeepMind has developed SynthID, an innovative technology aimed at identifying AI-generated text.

One effective strategy for addressing the challenge of distinguishing between human and AI-generated content is text watermarking. This technique embeds a subtle signature within the generated text, allowing for its later identification. A notable approach within this realm is generative watermarking, which integrates the watermark directly into the text generation process. By modifying the probability distribution of the next token, the language model (LLM) introduces a statistical signature that can be detected later. This method has the advantage of minimal computational overhead and does not require access to the underlying LLM, making it suitable for large-scale implementation.

How Generative Watermarking Works

The text generation process of an LLM is inherently autoregressive. It assigns probabilities to various vocabulary elements and selects the next token based on these probabilities. Generative watermarking exploits this autoregressive nature by altering the next-token sampling procedure. This adjustment introduces unique context-specific alterations to the generated text, resulting in a distinct statistical signature. During detection, this signature can be analyzed to ascertain if the text originated from the watermarked LLM.

 

image-20241112-123813.png

 

SynthID-Text represents a novel generative watermarking scheme that enhances existing techniques. It introduces a new sampling algorithm known as Tournament sampling, which can be configured for either non-distortionary (preserving text quality) or distortionary (enhancing watermark detectability) modes. In both configurations, SynthID-Text demonstrates superior detection rates compared to previous methods. The non-distortionary mode has already been successfully integrated into systems like Gemini and Gemini Advanced, showcasing its practical applicability in real-world scenarios.

Watermarking with SynthID-Text

LLMs generate text by considering preceding context, such as responding to a prompt. Given a sequence of tokens, the LLM calculates the probability distribution of the next token based on prior context. The generative watermarking scheme comprises three main components: a random seed generator, a sampling algorithm, and a scoring function. The random seed generator provides a seed for each generation step, while the sampling algorithm uses this seed to select the next token, introducing correlations that are measurable during watermark detection.

The Tournament sampling method is particularly noteworthy. It employs a tournament-like process to select an output token based on scores from multiple watermarking functions. By sampling multiple candidate tokens and iteratively eliminating lower-scoring options, the algorithm identifies a final output token that embeds the watermark.

 

image-20241112-123946.png

 

Watermark Detection

To detect watermarked text, SynthID-Text calculates a score based on the alignment between the generated text and the random seeds used during its creation. A higher score indicates a greater likelihood that the text is watermarked. Factors such as text length and the entropy of the LLM distribution significantly influence detection performance. Longer texts provide more evidence, while higher entropy allows for a broader range of token selections, thereby enhancing watermark strength.

Limitations

  1. Model Dependency: SynthID can only identify AI-generated content if it was created using a model that incorporates SynthID's watermarking technology. If a model doesn't utilize SynthID, the technology will be ineffective. Currently only the models that are developed by Google have this SynthID watermarking integrated.

  1. Watermark application is less effective on factual responses, as there is less opportunity to augment generation without decreasing accuracy.

  1. Detector confidence scores can be greatly reduced when an AI-generated text is thoroughly rewritten or translated to another language.

In summary, while SynthID and generative watermarking represent significant advancements in the identification of AI-generated text, challenges remain in ensuring comprehensive applicability and effectiveness across diverse contexts. As AI continues to evolve, so too will the methods for distinguishing its outputs from human-created content.