Grafana and Elasticsearch Integration

This document provides a comprehensive technical overview of integrating Grafana with Elasticsearch for data visualization and monitoring purposes.

About Grafana

Grafana is an open-source analytics and interactive visualization web application that provides charts, graphs, and alerts for monitoring and observability. It connects to various data sources including Elasticsearch, allowing users to create rich, interactive dashboards for data analysis and real-time monitoring.

Elasticsearch Data Source Configuration

Elasticsearch Endpoint URL:

The primary connection point that defines where Grafana communicates with the Elasticsearch cluster. This URL serves as the gateway for all data retrieval operations and must be accessible from the Grafana instance.

Index Configuration:

The index name specification is critical as it determines which dataset Grafana will query. For time-based logging systems, indices often follow patterns like agentic-foundry-tool-log or use time-based naming conventions. The index name must exactly match the Elasticsearch configuration to ensure proper data retrieval.

Field Mapping Configuration:

Message Field Name: Specifies the field containing the primary log content or data payload within the Elasticsearch documents. This field typically contains the actual log messages, event descriptions, or data content that will be displayed in visualizations.
Level Field Name: Defines the field used for log severity classification, containing values such as INFO, WARN, ERROR, or DEBUG. This field enables filtering and aggregation based on event severity levels.

Visualization Capabilities

Panel Types and Data Representation

Grafana supports multiple visualization types when working with Elasticsearch data:

Time Series Visualizations:

Ideal for displaying log volumes, error rates, and performance metrics over time. These visualizations automatically handle time-based data from Elasticsearch indices.

Table Visualizations:

Provide tabular representation of log data, allowing users to view individual records, search through messages, and analyze detailed event information.

Graph Visualizations:

Enable trend analysis, comparative views, and statistical representations of data patterns extracted from Elasticsearch queries.

Query Processing and Data Retrieval

Grafana automatically constructs Elasticsearch queries based on the configured data source parameters. The system handles:

Lucene Query Syntax: Grafana translates user inputs into proper Elasticsearch query syntax
Aggregation Operations: Supports various aggregation types including terms, date histograms, and metric aggregations
Time Range Filtering: Automatically applies time-based filters to queries based on dashboard time selections

Advanced Dashboard Features

Dynamic Filtering and Variables

Template Variables: Enable dynamic dashboard behavior by creating parameterized queries. Common variable types include:

session_id: For tracking specific user sessions
action_id: For filtering based on specific actions or events
action_on: For categorizing actions by target objects or systems

Interactive Filtering: Dashboards support real-time filtering capabilities allowing users to:

Apply date range filters to narrow down time periods
Filter by severity levels (ERROR, DEBUG, INFO, WARN)
Search within specific fields or message content
Apply multiple filter conditions simultaneously

Severity-Based Data Segmentation

The dashboard architecture supports granular severity-based filtering with four primary classification levels:

Error Level: Captures critical system failures, exceptions, and error conditions that require immediate attention.

Debug Level: Contains detailed diagnostic information useful for troubleshooting and development purposes.

Info Level: Provides general informational messages about system operations and normal processing activities.

Warn Level: Indicates potential issues or unusual conditions that don't constitute errors but warrant monitoring.

Technical Architecture Benefits

Real-Time Data Processing

The Grafana-Elasticsearch integration provides near real-time data visualization capabilities, automatically refreshing dashboards as new data arrives in Elasticsearch indices. This enables continuous monitoring and immediate visibility into system behavior.

Scalability and Performance

The integration leverages Elasticsearch's distributed architecture and query optimization capabilities, allowing dashboards to handle large volumes of log data efficiently. Grafana's query caching and optimization features further enhance performance for frequently accessed visualizations.

Flexibility and Customization

The system supports extensive customization options including custom query builders, variable-driven dashboards, and conditional formatting. Users can create complex analytical views tailored to specific monitoring requirements and operational needs.

6. LLM Usage & Cost Monitoring Dashboard

A pre-built Grafana Dashboard JSON is available for monitoring LLM (Language Model) usage and costs across the platform. The dashboard is organized into two main sections:

LLM Call Analysis

Tracks all individual API calls made to language models, sourced from the token_usage_logs table.

Metric	Description
Total Cost	Aggregate cost across all LLM API calls
Total Tokens Used	Combined prompt, completion, and cached token counts
Performance by Agent	Cost and token usage broken down per agent
Performance by Model	Cost and token usage broken down per model
Detailed Call Logs	Per-call records with timestamps, tokens, and cost

Query Level Analysis

Monitors user queries and their aggregate LLM costs, sourced from the query_token_usage table.

Metric	Description
Avg LLM Calls per Query	Average number of model calls triggered per user query
Cost per Query	Average and total cost at the query level
Query-Level Token Usage	Token consumption aggregated per query

Available Filters:

Agent — Filter metrics by a specific agent
Model — Filter by LLM model name
Status — Filter by query or call status
User ID — Filter metrics for a specific user