Moderation Layer - Features

Components of Input
Components of Output
- Template-based guardrails
- Model-based guardrails

Components of Input

Prompt and Temperature Parameters

The FM Moderation module requires two input parameters: Prompt and Temperature. The Prompt is the first text provided to the model to generate a response, while the Temperature parameter controls the creativity of the response. A lower temperature value produces more precise and deterministic outputs, while a higher value introduces more creative elements into the response.

Multi-Lingual Support

Multilingual support in FM (content) moderation entails the ability to process prompts and content in various languages. When a prompt is provided in any language, the system automatically detects the language and translates it to English for further processing. This functionality allows users to select options such as Google, Azure, or none, while ensuring seamless backend operations with language detection and translation mechanisms in place. This approach enables efficient and effective moderation across different languages, using automated processes to manage content consistently and accurately.

Emoji Moderation

In FM moderation, emoji support allows users to include emojis to convey sentiments or expressions within sentences. By selecting the "emoji" option and specifying "yes," users show the presence of emojis in their input. Behind the scenes, the system detects these emojis and seamlessly integrates them into the moderation process. This functionality ensures that emojis are recognized and interpreted as part of the input, enabling the moderation system to effectively consider them when generating results or making decisions. This capability significantly enhances the system's ability to understand and respond to content having emojis, thereby improving the accuracy and relevance of moderation outcomes.

Prompt Template

Goal Priority

Goal Priority is nothing but LLM checking involves assessing an LLM prompt for potential harmless and helpful. Importantly, you need to always prioritize safety over helpfulness. That is, if answering the user query could be unsafe or harmful, LLM refuses to answer. Otherwise provides a thorough and precise response, ensuring clarity, specificity, avoiding harmful stereotypes or discriminatory language. By conducting a smooth LLM check, developers can mitigate risks of generating harmful or misleading content, enhancing the overall quality and safety of the LLM's responses.

Self-Remainder

Self Remainder is Nothing but A Bergeron check in the context of LLM prompts is a safety measure designed to prevent jailbreaking, or malicious manipulation, of the model. It involves analyzing the prompt for patterns or keywords associated with harmful or unintended behaviors. This can include detecting prompts that request harmful content, promote violence, or try to exploit vulnerabilities in the model. By showing and mitigating these prompts, Bergeron checks help ensure the LLM operates safely and ethically.

Cove Complexity

Cove Complexity can be considered a synonymous term for The Cain of Verification. It is a rigorous evaluation method for LLMs that subjects generated responses to a series of increasingly complex logical questions. By tailoring the query difficulty to the selected level (simple, medium, or complex), this process effectively probes the LLM's ability to provide right, consistent, and logically sound information. Essentially, it acts as a quality control checkpoint, identifying potential weaknesses in the model's reasoning capabilities and factual knowledge.

LLM Model

LLM explainability is still a significant challenge for both GPT-3 and GPT-4. While GPT-4 generally demonstrates improved performance, understanding the internal decision-making processes of these models is still limited. Both models use as black boxes, making it difficult to pinpoint the exact reasoning behind their outputs. Researchers and engineers are actively exploring techniques like attention visualizations and model introspection to shed light on these complex systems, but a comprehensive understanding of LLM explainability is yet to be achieved.

Components of Output

In FM moderation, two types of outputs are distinguished: template-based guardrails and model-based guardrails.

Template-based guardrails

Where we make use of dynamic and efficient prompt templates through Prompt Engineering that enhance the detection capability of the llms to detect and block the adversarial attacks.

Request Moderation

In the Request Moderation layer, various checks are performed on the input prompt before generating a response. These checks include Prompt Injection, Jailbreak, Fairness and Bias, language critique coherence, language critique fluency, language critique grammar, language critique politeness, evaluator check, context relevance, context conciseness and context reranking. These checks ensure that the input prompt adheres to the defined guidelines and standards.

Prompt Injection Check

This check evaluates the presence of injected content in the input prompt. It calculates an injection confidence score based on the prompt and compares it against a dynamic injection threshold. The result writes down whether the check passes or fails.

Jailbreak Check

The Jailbreak check examines the input prompt for potential attempts to manipulate or bypass the moderation system. It calculates a jailbreak similarity score and compares it against a dynamic jailbreak threshold. The result says whether the check passes or fails.

Fairness and Bias

In the context of request moderation, fairness and bias analysis aims to show potential discriminatory patterns or biases in the system's responses. This analysis typically examines various aspects of the moderation process.

Language Critique Coherence

An evaluation of the logical connection and consistency between distinct parts of an explanation. It figures out how well the ideas are organized and linked together to form a cohesive narrative.

Language Critique Fluency

A measure of the smoothness and naturalness of the language used in an explanation. It assesses the flow of ideas and the overall readability of the text.

Language Critique Grammar

An assessment of the grammatical correctness and adherence to language conventions in an explanation. It evaluates the syntax, punctuation, and overall linguistic accuracy of the text.

Language Critique Politeness

A measure of the appropriateness and respectfulness of the language used in an explanation. It assesses the tone and choice of words to figure out if the explanation is polite and considerate of the audience.

Evaluator Check

Before delving into the output, it's essential to understand the concept. Elevator check request moderation typically refers to a process where given requests for elevator inspections, repairs, or maintenance are reviewed and approved or rejected based on certain criteria.

Infosys Advanced Jailbreak Check

Assuming "elevator check" is a misnomer or typo, and the actual focus is on security, Infosys Advanced Jailbreak Check likely refers to a system that assesses devices or systems for vulnerabilities that could potentially allow unauthorized access or control. This might involve checking for compromised software, weak passwords, or other security risks.

Infosys Random Noise Check

This could refer to a quality control or maintenance procedure where random elevators are checked for unusual noises. The output might include a list of elevators checked, the nature of the noise (if any), and the corresponding actions taken (e.g., maintenance scheduled).

Context Relevance

Evaluates how relevant the retrieved context is to the question specified. Context relevance score measures if the retrieved context has enough information to answer the question being asked. This check is important since a bad context reduces the chances of the model giving a relevant response to the question asked, as well as leads to hallucinations.

Context Conciseness

Evaluates the concise context cited from an original context for irrelevant information. Context conciseness refers to the quality of a reference context generated from retrieved context in terms of being clear, brief, and to the point. A concise context effectively conveys the necessary information without unnecessary elaboration or verbosity.

Context Reranking

Evaluates how efficient the reranked context is compared to the original context. Context Reranking reflects the efficiency of the reranking process applied to the original context in generating the new renamed context used to answer a given question. This operator assesses the degree to which the reranked context enhances the relevance, coherence, and informativeness with respect to the provided question.

Response Moderation

In the Response Moderation layer, the generated response from the Request Moderation layer is further evaluated before being presented to the user. This evaluation includes checks for language critique coherence, language critique fluency, language critique grammar, language critique politeness y. All the checks which are there in Request Moderation layer are same with response Moderation also except for response completeness, response conciseness, response validity and response completeness wrt context.

Response Completeness

Checks whether the response has answered all the aspects of the question specified. Response completeness score measures if the generated response has adequately answered all aspects to the question being asked. This check is important to ensure that the model is not generating incomplete responses.

Response Conciseness

Grades how concise the generated response is or if it has any other irrelevant information for the question asked. Response conciseness score measures whether the generated response holds any other information irrelevant to the question asked.

Response Validity

Checks if the response generated is valid or not. A response is valid if it holds any information. In some cases, an LLM might not generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to name these cases, where a model is not generating an informative response.

Response Completeness Wrt Context

Response completeness with respect to context in response moderation refers to the ability of a moderation system to accurately assess and address the content of a response based on its surrounding context. This involves understanding the nuances of the conversation, the intent of the user, and the potential impact of the response.

Response Comparison

The Response Comparison element compares the generated response from the FM Moderation module with a reference response from Infosys guardrail. This comparison helps ensure consistency and evaluate the effectiveness of the moderation process.

The FM Moderation module ensures that the generated responses are compliant, safe, and aligned with defined guidelines. It provides granular control over the content generated by AI language models, reducing risks associated with inappropriate or harmful outputs.

Response with Infosys RAI guardrails

Infosys RAI guardrails for response comparison likely involve a system that evaluates and compares generated responses against predefined quality standards and ethical guidelines. This process ensures that AI-generated content aligns with Infosys' values, is exact, relevant, and free from biases. By comparing responses to these guardrails, the system can name potential issues, suggest improvements, and keep a prominent level of quality and consistency in the AI outputs.

Results with gpt4

Comparing ChatGPT-4 responses involves analyzing the quality, relevance, and coherence of its outputs across various prompts and contexts.

Model-based guardrails

Model-based guardrails we use pre-trained models trained on extensive billion-parameter datasets to effectively combat adversarial attacks. These models are equipped with sophisticated algorithms capable of discerning and mitigating malicious content by employing various metrics such as scores and thresholds. These metrics are meticulously set based on detection entities or through comparative analysis of text embeddings, ensuring robust detection and response mechanisms against adversarial threats. This approach enables our system to support ambitious standards of security and reliability, safeguarding against potential risks and ensuring the integrity of moderated content.

Request Moderation

In the Request Moderation layer, various checks are performed on the input prompt before generating a response. These checks include Prompt Injection, Jailbreak, Privacy, Profanity, Toxicity, and Restricted Topic checks. These checks ensure that the input prompt adheres to the defined guidelines and standards.

Prompt Injection Check

This check evaluates the presence of injected content in the input prompt. It calculates an injection confidence score based on the prompt and compares it against a dynamic injection threshold. The result writes down whether the check passes or fails.

Jailbreak Check

The Jailbreak check examines the input prompt for potential attempts to manipulate or bypass the moderation system. It calculates a jailbreak similarity score and compares it against a dynamic jailbreak threshold. The result writes down whether the check passes or fails.

Privacy Check

The Privacy check allows configuration of specific entities that should be recognized and protected within the prompt, such as Aadhar numbers, passport details, or PAN numbers. It names the recognized entities and compares them against the configured entities to block. The result writes down whether the check passes or fails.

Profanity Check

The Profanity check names profane words within the input prompt. It reports the profane words named and compares them against a customized threshold. If the number of occurrences of profane words is greater than the threshold, it falls to failed category.

Toxicity Check

The Toxicity check assesses the level of toxicity in the input prompt based on metrics such as toxicity, severe toxicity, obscenity, identity attack, insult, threat, and sexual explicitness. It compares the calculated toxicity scores against a predefined toxicity threshold and at last we can assess whether response passes or fails.

Restricted Topic Check

The Restricted Topic check ensures that certain predefined topics, such as explosives, terrorism, or political subjects, are not included in the input prompt. We can configure what topics needs to be restricted.

Response Moderation

In the Response Moderation layer, the generated response from the Request Moderation layer is further evaluated before being presented to the user. This evaluation includes checks for Text Quality, Text Relevance, and Refusal. All the checks which are there in Request Moderation layer are same with response Moderation also except jailbreak check and Prompt injection check since the response from the LLM may not hold these attacks. Other than it has Privacy Check, Profanity Check, Toxicity Check, Restricted Topic Check.

Privacy Check

The Privacy check allows configuration of specific entities that should be recognized and protected within the prompt, such as Aadhar numbers, passport details, or PAN numbers. It finds the recognized entities and compares them against the configured entities to block. The result shows whether the check passes or fails.

Restricted Topic Check

The Restricted Topic check ensures that certain predefined topics, such as explosives, terrorism, or political subjects, are not included in the input prompt. We can configure what topics needs to be restricted.

Toxicity Check

The Toxicity check assesses the level of toxicity in the input prompt based on metrics such as toxicity, severe toxicity, obscenity, identity attack, insult, threat, and sexual explicitness. It compares the calculated toxicity scores against a predefined toxicity threshold and at last we can assess whether response passes or fails.

Profanity Check

The Profanity check finds profane words within the input prompt. It reports the profane words shown and compares them against a customized threshold. If the number of occurrences of profane words is greater than the threshold, it falls to failed category.

Text Quality

The Text Quality check assesses the readability and grade level of the generated response. Score stands for the grade level using this scale below.

Score	School level	Remarks
100.00-90.00	5th grade	Quite easy to read. Easily understood by an average 11-year-old student.
90.0-80.0	6th grade	Easy to read. Conversational English for consumers.
80.0-70.0	7th grade	Fairly easy to read.
70.0-60.0	8th & 9th grade	Plain English. Easily understood by 13- to 15-year-old students.
60.0-50.0	10th to 12th grade	Fairly difficult to read.
50.0-30.0	College	Difficult to read.
30.0-10.0	College graduate	Exceedingly difficult to read. Best understood by university graduates.
10.0-0.0	Professional	Extremely difficult to read. Best understood by university graduates.

Text Relevance Check

The Text Relevance check measures the relevance of the generated response with respect to the input prompt.

Refusal Check

The Refusal check finds cases where the language model refuses to provide a response due to content that violates norms or guidelines.

Additional Optional Checks added for Request Moderation and Response Moderation under Model-based Guardrails

Invisible Text Check

It is a package-based evaluation check designed to detect and remove non-printable, invisible Unicode characters from text inputs.

Gibberish Check

It is a model-based check designed to identify and filter out gibberish or nonsensical inputs in English language text.

Ban Code Check

It is a model-based check designed to detect code snippet in the prompt.

Sentiment Check

It is a package-based check with the primary objective to gauge the sentiment of a given prompt.