Grafana and Elasticsearch Integration
This document provides a comprehensive technical overview of integrating Grafana with Elasticsearch for data visualization and monitoring purposes.
About Grafana
Grafana is an open-source analytics and interactive visualization web application that provides charts, graphs, and alerts for monitoring and observability. It connects to various data sources including Elasticsearch, allowing users to create rich, interactive dashboards for data analysis and real-time monitoring.
Elasticsearch Data Source Configuration
Elasticsearch Endpoint URL:
The primary connection point that defines where Grafana communicates with the Elasticsearch cluster. This URL serves as the gateway for all data retrieval operations and must be accessible from the Grafana instance.
Index Configuration:
The index name specification is critical as it determines which dataset Grafana will query. For time-based logging systems, indices often follow patterns like agentic-foundry-tool-log or use time-based naming conventions. The index name must exactly match the Elasticsearch configuration to ensure proper data retrieval.
Field Mapping Configuration:
- Message Field Name: Specifies the field containing the primary log content or data payload within the Elasticsearch documents. This field typically contains the actual log messages, event descriptions, or data content that will be displayed in visualizations.
- Level Field Name: Defines the field used for log severity classification, containing values such as
INFO,WARN,ERROR, orDEBUG. This field enables filtering and aggregation based on event severity levels.
Visualization Capabilities
Panel Types and Data Representation
Grafana supports multiple visualization types when working with Elasticsearch data:
Time Series Visualizations:
Ideal for displaying log volumes, error rates, and performance metrics over time. These visualizations automatically handle time-based data from Elasticsearch indices.
Table Visualizations:
Provide tabular representation of log data, allowing users to view individual records, search through messages, and analyze detailed event information.
Graph Visualizations:
Enable trend analysis, comparative views, and statistical representations of data patterns extracted from Elasticsearch queries.
Query Processing and Data Retrieval
Grafana automatically constructs Elasticsearch queries based on the configured data source parameters. The system handles:
- Lucene Query Syntax: Grafana translates user inputs into proper Elasticsearch query syntax
- Aggregation Operations: Supports various aggregation types including terms, date histograms, and metric aggregations
- Time Range Filtering: Automatically applies time-based filters to queries based on dashboard time selections
Advanced Dashboard Features
Dynamic Filtering and Variables
Template Variables: Enable dynamic dashboard behavior by creating parameterized queries. Common variable types include:
session_id: For tracking specific user sessionsaction_id: For filtering based on specific actions or eventsaction_on: For categorizing actions by target objects or systems
Interactive Filtering: Dashboards support real-time filtering capabilities allowing users to:
- Apply date range filters to narrow down time periods
- Filter by severity levels (ERROR, DEBUG, INFO, WARN)
- Search within specific fields or message content
- Apply multiple filter conditions simultaneously
Severity-Based Data Segmentation
The dashboard architecture supports granular severity-based filtering with four primary classification levels:
Error Level: Captures critical system failures, exceptions, and error conditions that require immediate attention.
Debug Level: Contains detailed diagnostic information useful for troubleshooting and development purposes.
Info Level: Provides general informational messages about system operations and normal processing activities.
Warn Level: Indicates potential issues or unusual conditions that don't constitute errors but warrant monitoring.
Technical Architecture Benefits
Real-Time Data Processing
The Grafana-Elasticsearch integration provides near real-time data visualization capabilities, automatically refreshing dashboards as new data arrives in Elasticsearch indices. This enables continuous monitoring and immediate visibility into system behavior.
Scalability and Performance
The integration leverages Elasticsearch's distributed architecture and query optimization capabilities, allowing dashboards to handle large volumes of log data efficiently. Grafana's query caching and optimization features further enhance performance for frequently accessed visualizations.
Flexibility and Customization
The system supports extensive customization options including custom query builders, variable-driven dashboards, and conditional formatting. Users can create complex analytical views tailored to specific monitoring requirements and operational needs.
6. LLM Usage & Cost Monitoring Dashboard
A pre-built Grafana Dashboard JSON is available for monitoring LLM (Language Model) usage and costs across the platform. The dashboard is organized into two main sections:
LLM Call Analysis
Tracks all individual API calls made to language models, sourced from the token_usage_logs table.
| Metric | Description |
|---|---|
| Total Cost | Aggregate cost across all LLM API calls |
| Total Tokens Used | Combined prompt, completion, and cached token counts |
| Performance by Agent | Cost and token usage broken down per agent |
| Performance by Model | Cost and token usage broken down per model |
| Detailed Call Logs | Per-call records with timestamps, tokens, and cost |
Query Level Analysis
Monitors user queries and their aggregate LLM costs, sourced from the query_token_usage table.
| Metric | Description |
|---|---|
| Avg LLM Calls per Query | Average number of model calls triggered per user query |
| Cost per Query | Average and total cost at the query level |
| Query-Level Token Usage | Token consumption aggregated per query |
Available Filters:
Agent— Filter metrics by a specific agentModel— Filter by LLM model nameStatus— Filter by query or call statusUser ID— Filter metrics for a specific user