Caliber Documentation

The comprehensive A-Z guide for configuring, monitoring, and evaluating your AI agents with Caliber.

1. What does Caliber do?

Caliber is a specialized AI evaluation and monitoring platform. As Large Language Models (LLMs) and autonomous agents become deeply integrated into software systems, tracking their performance natively becomes difficult.

Caliber provides a centralized dashboard to observe AI payload latencies, track accuracy scores, flag toxic responses, and strictly monitor regressions involving Personally Identifiable Information (PII). It gives engineers the analytic power to know *exactly* how their agents behave over time.

2. Core Concepts

Evaluations (Evals)

An "Eval" is a single recorded interaction between your system and an AI model. It logs the prompt, the model's response, execution time, and qualitative metrics generated by our analyzers.

Metrics Engine

We automatically score each Eval across multiple dimensions: Accuracy (factual correctness), Latency (speed), Safety (toxicity checks), and PII Isolation (ensuring no protected user data is leaked).

3. Platform Features

Configuration & Setup

Navigate to the Config page to establish your environment variables, target models, and safety thresholds. This tells Caliber how strictly to grade your specific application's outputs.

Data Portability (CSV/JSON Export)

Found an anomaly? In the Evaluations table, you can export your filtered view to CSV or JSON formats. This portability allows your data science teams to perform deeper offline analysis or build custom historical reports.

Granular Search & Filtering

The Evaluations ledger features robust multi-parameter filtering. Drill down by date ranges, score thresholds, or specifically isolate categories to pinpoint exactly where models are failing.

Role Based UI (Admin vs Viewer)

Caliber enforces strict visual and functional boundaries depending on the user's role. Admins have full write-access to system configs and manual evaluation creation. Viewers are bound to read-only states, preventing accidental or malicious configuration overrides.

Trend Analysis Dashboard

The Dashboard aggregates metric streams into rich, chronological charts. Visualize latency spikes over a 30-day window or track the success rate of your PII-redaction models in a clean, visual format.

4. How to Ingest Data

To push data into Caliber, direct your backend services to hit our ingestion endpoint upon completing an AI transaction:

// POST /api/evals/ingest
fetch('https://caliber-ai.vercel.app/api/evals/ingest', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: "Summarize the user profile.",
    response: "The user is John Doe, residing in NYC.",
    latency_ms: 850,
    category: "Summarization",
    metadata: { model: "gpt-4-turbo" }
  })
});

The ingestion engine will automatically calculate PII risk factors, assign accuracy benchmarks based on your configuration parameters, and surface the data directly to your Dashboard in real-time.