Fairness & Bias - Technical Essentials
Fairness & Bias Evaluations for LLMs [Unstructured Data]
Prompt Templates for F&B evaluation:
Text: [Evaluate texts generated by LLM]
Given an un-structured text, a prompt template has been designed to check the text’s fairness and evaluate its bias indicator [High / Medium / Low / Neutral] using GPT-4o and LLAMA. The prompt template also provides additional information like “Affected group“, indicating the group of people affected by the context of the sentence and also the type of Bias [Historical Bias, Confirmation Bias, etc.,] as well. We are working to extend this prompt template to generate the neutral versions of the given text
Image: [Evaluate images generated by LLM]
“A picture can speak 1000 words. “
As good as this statement is, the context perceived about the picture just by looking at it can also differ from person to person. With this established, to see if a given picture / image is Fair or biased, we depend on the input prompt given to the LLM to generate this particular image. The input prompt given by the user sets the contextual expectation of the user and the picture / image generated can be validated with similar context. For the template-based approach, we are currently leveraging GPT-4o 's multimodal capabilities for evaluation. We have plans to extend this to Gemini as well in the future.
Fairness & Bias Evaluations for Traditional Models [Structured Data]
Metrics
Based on the selected sensitive / protected attribute for the given dataset, the positive / favorable outcome distribution is compared with the rest of the groups in the dataset and the metrics are calculated.
a. Pretrain & Post-train Methods:
i. Statistical Parity Difference:
The Statistical parity difference metric calculates the difference in the ratio of favorable outcomes between privileged groups and un-privileged groups.

ii. Disparate Impact Ratio:
The Disparate Impact Ratio metric calculates the ratio of favorable outcomes between privileged groups and un-privileged groups.

iii. Smooth Empirical Differential:
SED calculates the differential in the probability of favorable and unfavorable outcomes between intersecting groups divided by features. All intersecting groups are equal, so there are no unprivileged or privileged groups. The calculation produces a value between 0 and 1 that is the minimum ratio of Dirichlet smoothed probability for favorable and unfavorable outcomes between intersecting groups in the dataset.
iv. Four Fifths:
This function computes the four fifths rule (ratio of success rates) between group_unprivileged and group_privileged. The minimum of the ratio taken both ways is returned. A value of 1 is desired. Values below 1 are unfair. The range (0.8,1) is considered acceptable.
v. Cohen’s D:
This function computes the Cohen D statistic (normalized statistical parity) between group_unprivileged and group_privileged.A value of 0 is desired. Negative values are unfair towards group_unprivileged. Positive values are unfair towards group_privileged. Reference values: 0.2 is considered a small effect size, 0.5 is considered medium, 0.8 is considered large.

Exponentiated Gradient Reduction
Exponentiated Gradient Reduction:
Exponentiated gradient reduction is an in-processing technique that reduces fair classification to a sequence of cost-sensitive classification problems, returning a randomized classifier with the lowest empirical error subject to fair classification constraints. User can provide the dataset and the sensitive attributes in it. A new classification model from scikit learn will be instantiated and it’ll be mad aware of these sensitive attributes for more Fair predictions.
Model Analysis
Equalized Odds:
The greater of two metrics: true_positive_rate_difference and false_positive_rate_difference. The former is the difference between the largest and smallest of P[h(X) = 1|A = a, Y = 1], across all values 'a' of the sensitive feature(s). The latter is defined similarly, but for P[h(X) = 1|A = a, Y = 0]. The equalized odds difference of 0 means that all groups have the same true positive, true negative, false positive, and false negative rates.
Model Mitigation (Research)
Threshold optimizer:
Threshold optimizer is based on the paper Equality of Opportunity in Supervised Learning built to satisfy the specified fairness criteria exactly and with no remaining disparity. Threshold optimizer requires the sensitive features to be available at deployment time (i.e., for the predict method). For each sensitive feature value, Threshold optimizer creates separate thresholds and applies them to the predictions of the user-provided estimator. To decide on the thresholds, it generates all possible thresholds and selects the best combination in terms of the objective and the fairness constraints. The technique has too much randomness involved, and it's still in research phase.
Extending Basic evaluation mechanisms to useful tools
Generative AI use case evaluation and monitoring tool
Generative AI generates unstructured data. With capabilities like RAG, its now being used for making decisions in binary classification tasks as well. Based on the wide nature of the use-case, we are working on two types of approaches.
Decisive use-case flow:
Decisive solutions or classification solutions where there is a decision . Decisive solutions or classification solutions where there is a decision is made like loan approvals etc., based on set of features and rules involved. In these cases, features importance scores are to be calculated for each decision and stored. Also, the distribution of success rates, population of sub-groups involved vs categories are also recorded. Using periodic audits or live dashboards, the success rate distribution is cross-checked to ensure fairness in the system. When there are indicators of bias like system is mostly favoring a single or few groups, the audit team would cross check the results with the explanations recorded and verify if they are false alarms or an actual bias in system which would trigger further investigation.


Generic use-case flow:
If the AI model is not involved in any decision making instead it generates some text or image or any content, like summarization tasks etc., we will be monitoring the outputs using our prompt classification tool. This would convert the unstructured out put to structured output which can be used for fairness analysis. This will contain the bias analysis, indicators, affected groups, type of bias and more details can also be added based on use case requirements. Now the bias distribution of the generated content can be monitored and audited to keep check of the generated content. As the bias analysis using the prompt template is the key, we additionally recommend to use chain-of-thoughts and chain-of-verification to ensure that the analysis done is as accurate as possible.


Traditional AI use case evaluation monitoring tool
We are working on a tool that combines all these metrics and combines the explainability features to get a complete analysis of the use case. Below are key highlights of this tool.
Many to Many analysis: All categorical columns and their sub-classes will be analyzed and their distributions based on representation and success rates are provided to user for acceptance.
Statistical Analysis of key features: Study the dataset based on the ground truth column statistically and understand the key columns which play a decisive role.
Explainability: Once the model is trained and predictions are made, use global and local Explainability to identify key features and their weightage that contribute to the outcome.
Review the metrics, key feature weights and data distribution for a comprehensive F&B analysis of the given use case.
Proposed flow of actions :

Prompt classification Tool
We end up with a huge number of prompts with help of open-source datasets, hackathons, synthetic generators etc. They form the essential block of Benchmarking, fine-tuning of Small Language Models and so on. We need to give structure to these prompts to know their properties / features to understand their overall distribution in the given dataset. The purpose of this tool is to create the respective features each prompt and classify them and thus converting it to a structured data. Now we can generate pivot tables and graphs to understand the over and underrepresented types of prompts which would help to enhance the Benchmarking process and also the fine-tuning process of Small Language Models.
Sample Graph - Bias Indicator Distribution:
Below graph provides insights to the dataset by showing the distribution of bias severity.

Sample Graph - Bias Type Distribution:
Below graph provides insights to the dataset by showing the distribution of bias types.

Review of Insurance policy documents using Agentic Framework [Use-case based]
Disclaimer:
The below POC is done on experimental basis. The tool is designed with a perspective to help the Legal team of the client [The Insurance company], who can cross check the identified possible bias and its closely associated Local Law and they should call out for the action item on need basis. The model has generated the list of suspects from the policy document based on contextual overlap with the Government’s Law which needs more fine-tuning and advanced techniques to be accurate.
This research is for reviewing a document like an insurance policy and perform a bias evaluation. We are planning to implement the solution using an Agentic framework as this involves several components and decision making. Below are the steps planned to be included the solution. The flow is planned to be sequential, will be enhanced on need basis.
Current Implementation of Insurance Policy Fairness Analysis
Convert PDF to Markdown: We convert the PDF documents into Markdown format to better analyze the content while preserving the structure of the text and tables.
Chunking by Chapter: We create chunks of Markdown based on chapter names for comprehensive analysis.
PII Entity Detection: Each chunk is checked for any personally identifiable information (PII) entities.
LLM Processing: Each chunk, along with a detailed prompt template, is passed to the language model (LLM) for analysis.
Generate Bias Report: A detailed report is generated, summarizing any biases detected.
Zero-shot text classifier to detect stereotypes
Facebook/bart-large-mnli, a zero-shot text classification model, is fine-tuned to recognize and detect Stereotype and Non-Stereotype sentences. This provides a non-LLM based, budget friendly option for customers to choose in our RAI tool kit. We have fine-tuned this model to classify given unstructured text into classification labels - [Stereotype, Non-Stereotype, negated-Stereotype].
Small Language Model as F&B evaluator
With the evolution of small language models, we are attempting to fine tune them to leverage their NLP capabilities which forms the foundation for F&B analysis. Unlike the Foundation models, the Generative capabilities of these models could be leveraged for generating the decisions and their explanations as well. At present, due to the limited parameter size, the key challenges are with performing a creative analysis to identify even very minor bias present in the prompt. The prompt classification tool should help us with this fine-tuning process to achieve the sophisticated level of answers similar to what we get from LLMs like GPT-4o and Gemini. Models under research [Phi3-mini, TinyLlama, MobiLlama]
We have followed the following recipe for finetuning the Phi-3.5-mini:
Dataset Creation: We created a dataset for supervised fine-tuning with the assistance of GPT.
Model Quantization: The model was quantized to 4-bit using bitsandbytes to reduce its size.
Supervised Fine-Tuning: We performed supervised fine-tuning by freezing the majority of the parameters and training only the adapters for faster fine-tuning.
Hardware Used: The fine-tuning was conducted on Azure VM with 16 GB of GPU, which provided the necessary computational resources for efficient training.
Phi-3.5-mini-4k-instruct model was fine-tuned to generate the bias analysis similar to GPT-4o. This has limitations with detecting biases in long-context samples, very subtle and minute biases.
Limitations of Phi-3.5-Mini that we have observed.
Ineffective Bias Analysis in Long Contexts: The model is less effective at providing bias analysis in long-context samples due to limited training data compared to other hyper-scalers.
Increased Response Time: The response time is slower than expected, which may impact user experience in real-time applications. It takes more than 10 seconds in GPU environment and more than 30 seconds for CPU environment.
Limited Generalization: The model may struggle with generalizing insights from less common or niche topics, resulting in incomplete analyses.
The cost for a GPU environment is more than the cost involved in paying tokens for API services for LLMs like GPT, and we get more accurate responses and can support long context as well. We have put this research on hold for specific use case requests where we customize this for a particular use case.
Opensource Multimodal based image analysis
Qwen2.5-VM - We tried to quantize this model to 6 GB and use it for image’s fairness evaluation. The accuracy of the model is too low and the contextual analysis, object detection are very poor as well.
Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
Lack of Audio Support: The current model does not comprehend audio information within videos.
Reference: https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#limitations