List of Metrics

Metrics Detail & Threshold information

#

Phase

Type of Guardrails

Metric

Score Range / Output

Threshold**

Description

1

Request Moderation

Model Based Guardrails

Prompt Injection Score

0 to 1

0.7

Score of 0.7 or more indicates the model is highly confident that the prompt is ‘injected’, as less score like 0 or 0.1 etc. shows the model is confident that the prompt is ‘legit’.

2

Request Moderation

Model Based Guardrails

Jailbreak Score

0 to 1

0.7

Score of 0.7 or more indicates that Jailbreak attack is there in the prompt.

3

Request Moderation

Model Based Guardrails

Privacy Check

None / List of entities

 1 PII entity

As response we get a list of PII entities detected, if even one entity is detected the prompt gets blocked.

4

Request Moderation

Model Based Guardrails

Profanity Check

None / List of entities

1 word

If one or more profane words is/are detected in the prompt, it is blocked.

5

Request Moderation

Model Based Guardrails

Restricted Topic Check

0 to 1

 0.7

The response is restricted topic score of each restricted topic label we have passed in input, if even for one label the score is 0.7 or more, it means the model is confident that the text falls under that category and the prompt is blocked.

6

Request Moderation

Model Based Guardrails

Toxicity Check

0 to 1

 0.7

If the final toxicity score is 0.7 or more than it means the model has classified the prompt as highly toxic and the prompt gets blocked.

7

Request Moderation

Model Based Guardrails

Sentiment Check

-1 to 1

-0.01

If score is <=-0.01 it means the sentiment of the sentence is negative, the higher the score is the more positive the sentiment of the sentence is.

8

Request Moderation

Model Based Guardrails

Invisible Text Check

None / List of entities

1 entity

In the response we get list containing all entities present in the prompt which fall under different invisible text categories. Even if one such entity is detected the prompt gets blocked.

9

Request Moderation

Model Based Guardrails

Gibberish Check

0 to 1

0.7

If the label of the text detected in prompt is "word salad", "noise" or "mild gibberish" and its score is more than or equal to 0.7 then the prompt gets blocked.

10

Request Moderation

Model Based Guardrails

Ban Code Check

0 to 1

0.7

If the model detects code in prompt and gives a score of 0.7 or above, then the prompt is blocked.

11

Request Moderation

Template Based Guardrails

Prompt Injection Score

0 to 1

0.6

If the LLM detects prompt injection in the prompt and gives a score of 0.6 or more then the prompt is blocked.

12

Request Moderation

Template Based Guardrails

Jailbreak Score

0 to 1

0.6

If the LLM detects jailbreak in the prompt and gives a score of 0.6 or more then the prompt is blocked.

13

Request Moderation

Template Based Guardrails

Fairness & Bias Check

Low/Medium/High/Neutral

 

Categories determined by Fairness Score generated by LLM which ranges from 0 to 100. 0: Neutral, 1-30: Low, 31-70: Medium, >70: High

14

Request Moderation

Template Based Guardrails

Language Critique Coherence Check

0 to 1

0.6

If the LLM detects coherence issues in the prompt and gives a score of 0.6 or more then the prompt is blocked.

15

Request Moderation

Template Based Guardrails

Language Critique Fluency Check

0 to 1

0.6

If the LLM detects fluency issues in the prompt and gives a score of 0.6 or more then the prompt is blocked.

16

Request Moderation

Template Based Guardrails

Language Critique Grammar Check

0 to 1

0.6

If the LLM detects grammar issues in the prompt and gives a score of 0.6 or more then the prompt is blocked.

17

Request Moderation

Template Based Guardrails

Language Critique Politeness Check

0 to 1

0.6

If the LLM detects issues with politeness in the prompt and gives a score of 0.6 or more then the prompt is blocked.

18

Response Moderation

Model Based Guardrails

Privacy Check

None / List of entities

1 PII entity

As response we get a list of PII entities detected, if even one entity is detected the response gets blocked.

19

Response Moderation

Model Based Guardrails

Restricted Topic Check

0 to 1

0.7 

The response is restricted topic score of each restricted topic label we have passed in payload, if even for one label the score is 0.7 or more, it means the model is confident that the response falls under that category and the response is blocked.

20

Response Moderation

Model Based Guardrails

Toxicity Check

0 to 1

0.7 

If the final toxicity score is 0.7 or more than it means the model has classified the response as highly toxic and the response gets blocked.

21

Response Moderation

Model Based Guardrails

Text Relevance Check (Similarity Score)

0 to 100

 NA

This check is to check the similarity between the prompt given and the response generated.

22

Response Moderation

Model Based Guardrails

Refusal Check

0 to 1

0.7

If the score is 0.7 or more the response is blocked as it has high similarity with the foundation model refusal responses.

23

Response Moderation

Model Based Guardrails

Text Quality

Grades 

 NA

The Text Quality check assesses the readability and grade level of the generated response. For scoring and grading information refer here - https://infosys.atlassian.net/wiki/spaces/IF/pages/975077603/Moderation Layer - Features#[inlineExtension]Text-Quality

24

Response Moderation

Model Based Guardrails

Profanity Check

None / List of entities

1 word

If one or more profane words is/are detected in the response, it is blocked.

25

Response Moderation

Model Based Guardrails

Sentiment Check

-1 to 1

-0.01

If score is <=-0.01 it means the sentiment of the response is negative, the higher the score is the more positive the sentiment of the response is.

26

Response Moderation

Model Based Guardrails

Invisible Text Check

None / List of entities

1 entity

In the response we get list containing all entities present in the response which fall under different invisible text categories. Even if one such entity is detected the response gets blocked.

27

Response Moderation

Model Based Guardrails

Gibberish Check

0 to 1

0.7

If the label of the text detected in response is "word salad", "noise" or "mild gibberish" and its score is more than or equal to 0.7 then the response gets blocked.

28

Response Moderation

Model Based Guardrails

Ban Code Check

0 to 1

0.7

If the model detects code in response and gives a score of 0.7 or above, then the response is blocked.

29

Response Moderation

Template Based Guardrails

Completeness Check

0 to 1

0.6

If the LLM detects completeness issues in the response and gives a score of 0.6 or more then the prompt is blocked.

30

Response Moderation

Template Based Guardrails

Conciseness Check

0 to 1

0.6

If the LLM detects conciseness issues in the response and gives a score of 0.6 or more then the response is blocked.

31

Response Moderation

Template Based Guardrails

Language Critique Coherence Check

0 to 1

0.6

If the LLM detects coherence issues in the response and gives a score of 0.6 or more then the response is blocked.

32

Response Moderation

Template Based Guardrails

Language Critique Fluency Check

0 to 1

0.6

If the LLM detects fluency issues in the response and gives a score of 0.6 or more then the prompt is response.

33

Response Moderation

Template Based Guardrails

Language Critique Grammar Check

0 to 1

0.6

If the LLM detects grammar issues in the response and gives a score of 0.6 or more then the response is blocked.

34

Response Moderation

Template Based Guardrails

Language Critique Politeness Check

0 to 1

0.6

If the LLM detects impolite tone in the response and gives a score of 0.6 or more then the response is blocked.

35

Response Comparison

Template Based Guardrails

Hallucination Score (ThoT)

0 to 100

 

Values closer to 0 indicates lesser Hallucination. 0-30: Low, 30-50: Medium, > 50: High

36

Response Comparison

Template Based Guardrails

Uncertainty Score (ThoT, CoT, RE2, LoT, GoT#)

0 to 100

 

Highly Certain: >=0 and <=30 (less uncertainty)
Moderately Certain: >30 and <=70 (moderately uncertain)
Less Certain: >70 and <=100 (highly uncertain)

37

Response Comparison

Template Based Guardrails

Coherence Score (ThoT, CoT, RE2, LoT, GoT#)

0 to 100

 

Less Coherent: >=0 and <=30
Moderately Coherent: >30 and <=70
Highly Coherent: >70 and <=100

**Please note that the thresholds outlined are general recommendations derived from industry best practices. These values are configurable and may be adjusted to align with the unique demands of each application.

# ThoT - Thread of Thoughts, CoT - Chain of Thoughts, RE2 - Re-Read Reasoning, LoT - Logic of Thoughts, GoT - Graph of Thoughts