So far, we’ve explored models, agents, LLM applications, and their use cases. When moving toward production, however, additional concerns arise: performance, regulatory requirements, scalable deployment, and ongoing monitoring.
This chapter focuses on evaluation and observability, covering key aspects of governance and lifecycle management for operational AI systems, including generative models. Offline evaluation helps assess model capabilities in controlled settings, while production observability provides continuous insight into real-world performance. Together, they support reliable and effective LLM operation across the model lifecycle.
We introduce tools and examples for both evaluation and observability, and we also cover deployment strategies for LLM-based applications. This includes an overview of common tooling and practical deployment examples using FastAPI and Ray Serve.
Table of Contents
We begin with an introduction to MLOps for LLMs and generative models, outlining what it is and what it includes.
Introduction
As discussed throughout this tutorial, large language models (LLMs) have gained widespread adoption due to their ability to generate human-like text across use cases such as creative writing, chatbots, and decision support. Moving these models from research to real-world production, however, introduces significant technical, operational, and ethical challenges.
This chapter focuses on productionizing generative AI responsibly. We cover practical considerations including inference and serving requirements, optimization techniques, and critical concerns such as data quality, bias, transparency, and compliance. At scale, architectural and infrastructure decisions directly impact reliability, cost, and user experience, while rigorous testing, auditing, and ethical safeguards are essential for trustworthy deployment.
Deploying applications composed of LLMs, agents, and tools introduces several key challenges:
- Data Quality and Bias: Training data can embed biases that surface in outputs. Careful data curation and output monitoring are essential.
- Ethical and Compliance Considerations: LLMs may generate harmful or misleading content. Review processes, safety guidelines, and regulatory compliance (e.g., HIPAA in healthcare) are required.
- Resource Requirements: Training and serving LLMs demand substantial compute resources, making efficient infrastructure critical.
- Drift and Performance Degradation: Continuous monitoring is needed to detect data drift and declining performance.
- Lack of Interpretability: LLMs often behave as black boxes, requiring interpretability tools to improve transparency.
Taking an LLM into production requires careful planning around scalability, monitoring, testing, and unintended behaviors. Techniques such as fine-tuning, safety interventions, and defensive design help create applications that are helpful, harmless, and honest. With appropriate safeguards, generative AI has the potential to deliver substantial value across industries including healthcare, education, and customer service.
Several recurring patterns help address these challenges:
- Evaluations: Task-appropriate benchmarks and metrics are essential to measure capabilities, regressions, and alignment.
- Retrieval Augmentation: External knowledge reduces hallucinations and provides up-to-date context.
- Fine-tuning: Task-specific tuning improves performance, with methods such as adapters reducing overhead.
- Caching: Reusing outputs lowers latency and cost, though cache validity must be managed carefully.
- Guardrails: Syntactic and semantic validation improves reliability and output structure.
- Defensive UX: Interfaces should anticipate inaccuracies through disclaimers, attribution, and user feedback.
- Monitoring: Tracking metrics, behaviors, and user satisfaction enables early detection of issues.
Chapter 5 introduced safety-aligned approaches such as Constitutional AI. Ethical guidelines and human review processes remain critical to prevent the dissemination of harmful or misleading content. Continuous evaluation is required not only for legal and reputational reasons, but also to detect data drift and capability loss.
LLMs are computationally intensive, often requiring GPUs or TPUs for deployment. As model size grows, training and inference costs increase significantly, making distributed techniques such as data and model parallelism essential. Efficient storage, retrieval, and inference optimization—through compression, quantization, or hardware-specific tuning—are also key considerations (some discussed in Chapter 8).
Interpretability remains especially important in high-stakes domains such as healthcare, finance, and law. Techniques like attention visualization, feature attribution, and explanation generation help improve transparency and accountability.
With thoughtful preparation, generative AI can transform many domains. This chapter serves as a practical guide to the missing pieces needed to build impactful and responsible LLM applications, covering data curation, model development, infrastructure, monitoring, and transparency. Before continuing, we introduce key terminology.
Terminology
MLOps focuses on reliably and efficiently deploying and maintaining machine learning models in production, combining DevOps practices with machine learning workflows to meet business and regulatory requirements.
LLMOps is a specialized subset of MLOps addressing the unique challenges of fine-tuning and deploying large language models, such as models with hundreds of billions of parameters.
LMOps is a broader term encompassing operational practices for both large and smaller language models, reflecting the expanding landscape of generative models.
FOMO (Foundational Model Orchestration) highlights challenges specific to foundational models, including multi-step workflows, orchestration, and integration with external tools.
ModelOps emphasizes governance and lifecycle management of deployed AI and decision models.
AgentOps extends these ideas to autonomous or semi-autonomous agents, focusing on behavior management, tool usage, environment control, and coordination between agents.
While these terms reflect rapid evolution in the field, MLOps remains the most established and widely adopted concept. For consistency, we use MLOps throughout the remainder of this chapter.
Before productionizing any model or agent, evaluation is the first step. We therefore begin with evaluation methods, focusing on those provided by LangChain.
How to evaluate your LLM apps?
Evaluating LLMs—either standalone or as part of an agent or chain—is a critical step in the machine learning lifecycle. Evaluation ensures models behave correctly and produce reliable, efficient, and useful outputs. The goal is to understand strengths and weaknesses, improve accuracy, reduce errors, and ultimately maximize real-world impact.
Evaluation is typically performed offline during development, where models are tested under controlled conditions. This includes hyperparameter tuning, benchmarking against baselines or peer models, and regression testing. Offline evaluation provides an essential first signal before deployment, even though it cannot fully capture production behavior.
Evaluations help determine how well an LLM generates outputs that are relevant, accurate, and helpful. LangChain offers several evaluation approaches, including output comparison, pairwise string evaluation, string and embedding distances, and criteria-based scoring. Results can be aggregated to identify preferred models or prompts, and statistical measures such as confidence intervals or p-values can be used to assess robustness.
LangChain provides multiple evaluators for LLM outputs. A common pattern is pairwise comparison, where an evaluator model chooses between two outputs for the same input. The aggregated results indicate an overall preference. Other evaluators score outputs against criteria such as correctness, relevance, or conciseness, even without reference labels.
By default, LangChain uses GPT-4 as the evaluation model, but this can be customized (e.g., ChatOpenAI or ChatAnthropic) when loading evaluators.
Comparing two outputs
Pairwise evaluation requires:
- An evaluator
- A dataset of inputs
- Two or more LLMs, chains, or agents
The typical workflow is:
- Create the evaluator using
load_evaluator()(e.g.,pairwise_string). - Select a dataset of evaluation inputs.
- Define models to compare, including their configurations.
- Generate responses for each model, usually in batches.
- Evaluate pairs, often with randomized ordering to reduce bias.
Example from the LangChain documentation:
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("labeled_pairwise_string")
evaluator.evaluate_string_pairs(
prediction="there are three dogs",
prediction_b="4",
input="how many dogs are in the park?",
reference="four",
)
Example output:
{
'reasoning': 'Both responses are relevant to the question asked...',
'value': 'B',
'score': 0
}
The result typically includes a score between 0 and 1 and may include reasoning explaining the evaluator’s choice. In this example, both predictions are incorrect relative to the reference. While references can be omitted and judged solely by an LLM, doing so introduces risk if the evaluator itself is wrong.
Comparing against criteria
LangChain also supports criteria-based evaluation, where outputs are scored against predefined or custom rubrics. Common criteria include conciseness, relevance, correctness, coherence, helpfulness, and controversiality.
The CriteriaEvalChain evaluates outputs with or without reference labels:
- Without references, the evaluator scores outputs directly against the criteria.
- With references, outputs are compared to ground truth labels to assess compliance.
Custom criteria can be defined as a dictionary of descriptions:
custom_criteria = {
"simplicity": "Is the language straightforward and unpretentious?",
"clarity": "Are the sentences clear and easy to understand?",
"precision": "Is the writing precise, with no unnecessary words or details?",
"truthfulness": "Does the writing feel honest and sincere?",
"subtext": "Does the writing suggest deeper meanings or themes?",
}
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria)
evaluator.evaluate_string_pairs(
prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each hous",
prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harm",
input="Write some prose about families.",
)
This approach enables nuanced qualitative comparisons, as reflected in evaluator reasoning such as:
{'reasoning': 'Response A is simple, clear, and precise...'}
In addition to custom criteria, LangChain includes predefined principles, such as those inspired by Constitutional AI, which are useful for evaluating ethical, harmful, or sensitive content. Principle-based evaluation allows teams to systematically assess alignment and safety alongside task performance.
String and semantic comparisons
LangChain supports both string-based and semantic comparison methods for evaluating LLM outputs.
String distance metrics such as Levenshtein and Jaro provide quantitative similarity scores between predicted and reference strings. These metrics are simple, fast, and useful for basic unit tests and accuracy checks.
For deeper evaluation, embedding distance evaluators measure semantic similarity by comparing vector representations of generated and reference texts. These embeddings can be computed using models such as GPT-based embeddings or Hugging Face’s SentenceTransformers, often producing more meaningful results than pure string distance.
Example from the documentation:
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("embedding_distance")
evaluator.evaluate_strings(
prediction="I shall go",
reference="I shan't go"
)
This returns a score such as:
0.0966466944859925
Different embedding models can be selected via the embeddings parameter in load_evaluator(). In addition to embedding-based evaluators, LangChain also provides traditional string distance evaluators for comparing predicted outputs against references or inputs.
LangChain further supports agent trajectory evaluation, where the evaluate_agent_trajectory() method evaluates not just the final output, but also the sequence of steps taken by an agent.
Benchmark dataset with LangSmith
LangSmith enables systematic evaluation of model performance against datasets. To get started, create an account at: https://smith.langchain.com
After obtaining an API key, set it as an environment variable and enable tracing:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "My Project"
This configuration logs LangChain runs to LangSmith. If no project is specified, runs are logged to the default project.
Logging a run
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
llm.predict("Hello, world!")
Runs can be viewed in the LangSmith UI or listed programmatically:
from langsmith import Client
client = Client()
runs = client.list_runs()
print(runs)
Each run includes inputs and outputs:
print(f"inputs: {runs[0].inputs}")
print(f"outputs: {runs[0].outputs}")
Creating a dataset
Datasets can be created from existing runs or manually defined inputs:
questions = [
"A ship's parts are replaced over time until no original parts remain. Is it still the same s",
"If someone lived their whole life chained in a cave seeing only shadows, how would they reac",
"Is something good because it is natural, or bad because it is unnatural? Why can this be a f",
"If a coin is flipped 8 times and lands on heads each time, what are the odds it will be tail",
"Present two choices as the only options when others exist. Is the statement \"You're either",
"Do people tend to develop a preference for things simply because they are familiar with them",
"Is it surprising that the universe is suitable for intelligent life since if it weren't, no",
"If Theseus' ship is restored by replacing each plank, is it still the same ship? What is ide",
"Does doing one thing really mean that a chain of increasingly negative events will follow?",
"Is a claim true because it hasn't been proven false? Why could this impede reasoning?",
]
shared_dataset_name = "Reasoning and Bias"
ds = client.create_dataset(
dataset_name=shared_dataset_name,
description="A few reasoning and cognitive bias questions",
)
for q in questions:
client.create_example(inputs={"input": q}, dataset_id=ds.id)
Running a model on the dataset
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
llm = ChatOpenAI(model="gpt-4", temperature=0.0)
def construct_chain():
return LLMChain.from_string(
llm,
template="Help out as best you can.\nQuestion: {input}\nResponse: ",
)
A constructor function ensures a fresh chain for each input.
Defining evaluators
from langchain.smith import RunEvalConfig
evaluation_config = RunEvalConfig(
evaluators=[
RunEvalConfig.Criteria(
{"helpfulness": "Is the response helpful?"}
),
RunEvalConfig.Criteria(
{"insightful": "Is the response carefully thought out?"}
),
]
)
Running evaluation
from langchain.smith import run_on_dataset
results = run_on_dataset(
client=client,
dataset=dataset,
llm_factory=lambda: my_agent,
evaluation=evaluation_config,
)
An asynchronous version, arun_on_dataset(), is also available.
Evaluator feedback can be inspected directly in the LangSmith UI:

Clicking an evaluation reveals the prompt used by the evaluator, including the original model output and evaluation criteria. For example, an insightfulness evaluator may conclude:
The submission provides a clear and concise explanation of the appeal to nature fallacy…
Therefore, the submission does meet the criterion of being insightful and carefully thought out.
LangSmith can also assist with few-shot prompting by identifying high-quality examples from datasets. Additional examples are available in the LangSmith documentation.
This concludes evaluation. Once we are satisfied with agent performance, we can move on to deployment.
How to deploy your LLM apps?
As LLM adoption grows across industries, effective production deployment becomes essential. Deployment frameworks and services help address scalability, reliability, and operational complexity. Productionizing LLM applications requires familiarity with the broader generative AI ecosystem, including:
- Models and LLM-as-a-Service: Hosted APIs or self-managed models.
- Reasoning heuristics: RAG, Tree-of-Thought, and related techniques.
- Vector databases: Retrieval of relevant context for prompts.
- Prompt engineering tools: Enable in-context learning without costly fine-tuning.
- Pre-training and fine-tuning: Domain- or task-specific optimization.
- Prompt logging, testing, and analytics: Tools to understand and improve model behavior.
- Custom LLM stacks: Tooling for deploying open-source model solutions.
Earlier chapters covered models, reasoning heuristics, vector databases, and fine-tuning. This chapter focuses on logging, monitoring, and deployment tooling.
LLMs can be consumed via external providers (e.g., OpenAI, Anthropic), where infrastructure is managed for you, or via self-hosted open-source models, which can reduce cost, latency, and privacy risks.
Several frameworks offer end-to-end deployment capabilities. For example:
- Chainlit enables ChatGPT-like UIs with LangChain agents.
- BentoML packages models as scalable microservices with OpenAPI and gRPC endpoints.
- Steamship provides managed endpoints, horizontal scaling, persistent state, and multi-tenancy.
- Azure ML Online Endpoints support enterprise-grade deployment.
Deployment services and frameworks
| Name | Description | Type |
|---|---|---|
| Streamlit | Build and deploy Python web apps | Framework |
| Gradio | Model interfaces hosted on Hugging Face | Framework |
| Chainlit | ChatGPT-like conversational apps | Framework |
| Apache Beam | Data processing and orchestration | Framework |
| Vercel | Deploy and scale web apps | Cloud Service |
| FastAPI | High-performance Python API framework | Framework |
| Fly.io | App hosting with autoscaling and CDN | Cloud Service |
| DigitalOcean App Platform | Managed app deployment | Cloud Service |
| Google Cloud | Container hosting (e.g., Cloud Run) | Cloud Service |
| Steamship | ML infrastructure platform | Cloud Service |
| langchain-serve | Serve LangChain agents as APIs | Framework |
| BentoML | Model serving and deployment | Framework |
| OpenLLM | Open APIs to commercial LLMs | Framework |
| Databutton | No-code ML app deployment | Cloud Service |
| Azure ML | Managed MLOps service | Cloud Service |
All tools are well documented and support common LLM use cases. We’ve already seen Streamlit and Gradio deployments, including hosting on Hugging Face Hub.
Core deployment requirements
LLM applications typically require:
- Scalable infrastructure for compute-intensive workloads
- Low-latency inference
- Persistent storage for conversations and state
- APIs for integration
- Monitoring and logging for metrics and behavior
Cost management is a major challenge. Common strategies include self-hosting, autoscaling, batching requests, spot instances, and independent scaling of components.
Flexibility is critical in a rapidly evolving ecosystem. Avoid tight coupling to a single vendor by using modular architectures, Infrastructure as Code (Terraform, CloudFormation, Kubernetes), and CI/CD pipelines to automate testing and deployment.
LangChain integrates well with frameworks such as Ray Serve, BentoML, OpenLLM, Modal, and Jina. Next, we demonstrate deployment using FastAPI.
FastAPI webserver
FastAPI is a popular, high-performance Python framework for building APIs. Lanarky is a lightweight open-source library that wraps FastAPI, Flask, and Gradio to simplify LLM app deployment, providing both REST endpoints and browser-based UIs with minimal code.
A REST API enables applications to communicate over HTTP using standard methods (GET, POST, etc.), typically exchanging JSON.
Using an example from the Lanarky documentation, we implement a chatbot webserver.
Setup and imports
from fastapi import FastAPI
from lanarky.testing import mount_gradio_app
from langchain import ConversationChain
from langchain.chat_models import ChatOpenAI
from lanarky import LangchainRouter
from starlette.requests import Request
from starlette.templating import Jinja2Templates
Ensure environment variables are set as described in Chapter 3.
Create the LLM chain
def create_chain():
return ConversationChain(
llm=ChatOpenAI(
temperature=0,
streaming=True,
),
verbose=True,
)
chain = create_chain()
Initialize the app and templates
app = mount_gradio_app(FastAPI(title="ConversationChainDemo"))
templates = Jinja2Templates(directory="webserver/templates")
Define routes
@app.get("/")
async def get(request: Request):
return templates.TemplateResponse("index.xhtml", {"request": request})
Add LangChain routes
langchain_router = LangchainRouter(
langchain_url="/chat", langchain_object=chain, streaming_mode=1
)
langchain_router.add_langchain_api_route(
"/chat_json", langchain_object=chain, streaming_mode=2
)
langchain_router.add_langchain_api_websocket_route("/ws", langchain_object=chain)
app.include_router(langchain_router)
Run the server
uvicorn webserver.chat:app --reload
The app is available at:
http://127.0.0.1:8000
The --reload flag restarts the server automatically on code changes.

This setup provides a REST API, web UI, and WebSocket interface. While Uvicorn does not handle load balancing itself, it integrates easily with tools like Nginx or HAProxy for horizontal scaling, improved latency, and fault tolerance.
Next, we explore scalable deployment with Ray.
Ray
Ray is a flexible framework for scaling generative AI workloads across clusters. It supports low-latency serving, distributed training, and large-scale batch inference. Key capabilities include:
- Distributed training with Ray Train
- Scalable serving with Ray Serve
- Parallel batch inference with Ray Data
- End-to-end workflow orchestration
Using LangChain and Ray, we build a simple semantic search engine over Ray documentation (based on examples from the Anyscale blog and langchain-ray).
Indexing documents
loader = RecursiveUrlLoader("docs.ray.io/en/master/")
docs = loader.load()
chunks = text_splitter.create_documents(
[doc.page_content for doc in docs],
metadatas=[doc.metadata for doc in docs]
)
embeddings = LocalHuggingFaceEmbeddings('multi-qa-mpnet-base-dot-v1')
db = FAISS.from_documents(chunks, embeddings)
To accelerate indexing, embeddings can be computed in parallel:
@ray.remote(num_gpus=1)
def process_shard(shard):
embeddings = LocalHuggingFaceEmbeddings('multi-qa-mpnet-base-dot-v1')
return FAISS.from_documents(shard, embeddings)
shards = np.array_split(chunks, 8)
futures = [process_shard.remote(shard) for shard in shards]
results = ray.get(futures)
db = results[0]
for result in results[1:]:
db.merge_from(result)
Save the index:
db.save_local(FAISS_INDEX_PATH)
Serving with Ray Serve
db = FAISS.load_local(FAISS_INDEX_PATH)
embedding = LocalHuggingFaceEmbeddings('multi-qa-mpnet-base-dot-v1')
@serve.deployment
class SearchDeployment:
def __init__(self):
self.db = db
self.embedding = embedding
def __call__(self, request):
query_embed = self.embedding(request.query_params["query"])
results = self.db.max_marginal_relevance_search(query_embed)
return format_results(results)
deployment = SearchDeployment.bind()
serve.run(deployment)
Query the service:
import requests
query = "What are the different components of Ray and how can they help with LLMs?"
response = requests.post(
"http://localhost:8000/", params={"query": query}
)
print(response.text)
Ray also provides a powerful monitoring interface:

The dashboard exposes metrics and system state. Metrics collection is straightforward using counters, gauges, and histograms, and can integrate with Prometheus or Grafana for time-series visualization. The full example also supports FastAPI-based serving.
This concludes deployment. As LLM applications scale and become business-critical, observability and monitoring are essential to ensure ongoing reliability, performance, and correctness. The next section focuses on monitoring strategies and key metrics for LLM systems.
How to Observe LLM Apps
Offline evaluations rarely cover all real-world scenarios LLMs may face in production. Observability addresses this gap by enabling continuous, real-time monitoring to capture anomalies that offline tests miss. It involves logging, tracking, tracing, and alerting to ensure system health, optimize performance, and detect issues like model drift early. LLMs are increasingly critical in domains like health, e-commerce, and education.
Tracking, Tracing, and Monitoring
These three concepts are essential in software operations:
- Tracking & Tracing: Keep detailed historical records for debugging and analysis.
- Monitoring: Real-time oversight to detect issues and maintain system functionality.
All three fall under observability, with monitoring focused on metrics like CPU, memory, network latency, and application response times. Effective monitoring also includes alerts for anomalies.
Goals of Monitoring LLMs
Monitoring provides insights into model performance and behavior through live data, enabling:
- Preventing Model Drift: Detect early when models degrade due to shifts in input or user behavior.
- Performance Optimization: Track inference times, throughput, and resource usage.
- A/B Testing: Compare model variants to guide improvements.
- Debugging: Quickly identify and resolve unforeseen runtime issues.
Key Monitoring Considerations
- Metrics: Prediction accuracy, latency, throughput, etc.
- Frequency: Critical models may need near real-time monitoring.
- Logging: Comprehensive logs to trace anomalies.
- Alerting: Notify on performance drops or anomalous behavior.
Relevant Metrics
| Metric | Purpose |
|---|---|
| Inference Latency | Ensures fast, responsive outputs |
| QPS (Queries/sec) | Assesses scalability |
| TPS (Tokens/sec) | Evaluates efficiency & compute needs |
| Token Usage | Monitors resource consumption & costs |
| Error Rate | Maintains output quality |
| Resource Utilization | Optimizes CPU/GPU/memory usage |
| Model Drift | Detects output changes over time |
| Out-of-Distribution Inputs | Flags unexpected queries |
| User Feedback | Tracks satisfaction and validates performance |
Data scientists should also check for staleness, bias, or sudden feature importance changes using tools like LIME and SHAP, as offline metrics like AUC may not reflect real-world impact. Direct business metrics (e.g., clicks, purchases) are often more meaningful.
Effective monitoring ensures reliable LLM deployment, builds user trust, and maximizes system efficiency. Always verify privacy and data protection policies, especially when using cloud platforms.
Next, we’ll explore monitoring the trajectory of an agent.
Tracking and Tracing LLM Agents
Tracking records information about operations within an application. In ML projects, this includes parameters, hyperparameters, metrics, and outcomes across experiments—helping document progress and changes over time.
Tracing is a specialized form of tracking, recording the execution flow, especially in distributed systems. It creates a detailed “breadcrumb trail” for each request, helping identify latency or failures by showing exactly where they occur.
Tracking agent trajectories can be complex due to their broad actions and generative outputs. LangChain simplifies this with trajectory tracking and evaluation. By setting return_intermediate_steps=True when initializing an agent or LLM, you can capture detailed traces.
Example: Ping a website using an LLM agent:
@tool
def ping(url: HttpUrl, return_error: bool) -> str:
"""Ping the fully specified url. Must include https://"""
hostname = urlparse(str(url)).netloc
completed_process = subprocess.run(
["ping", "-c", "1", hostname], capture_output=True, text=True
)
output = completed_process.stdout
if return_error and completed_process.returncode != 0:
return completed_process.stderr
return output
llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)
agent = initialize_agent(
llm=llm,
tools=[ping],
agent=AgentType.OPENAI_MULTI_FUNCTIONS,
return_intermediate_steps=True,
)
result = agent("What's the latency like for https://langchain.com?")
The agent reports the latency, and result["intermediate_steps"] shows all actions taken, providing full visibility into the agent’s behavior.
Observability Tools for LLMs
Many tools integrate with LangChain to enhance tracking, tracing, and monitoring:
| Tool | Features |
|---|---|
| Argilla | Human-in-the-loop data curation for fine-tuning |
| Portkey | Metrics, tracing, caching, retries |
| Comet.ml | Experiment tracking, model comparison, optimization |
| LLMonitor | Cost, usage analytics, tracing, evaluation |
| DeepEval | Relevance, bias, toxicity metrics; model drift testing |
| Aim | Logs inputs, outputs, serialized state for visual debugging |
| Splunk ML Toolkit | Production model observability |
| ClearML | Automates training pipelines, research → production |
| IBM Watson OpenScale | AI health monitoring and risk mitigation |
| DataRobot MLOps | Detects performance issues before they impact users |
| Datadog APM | Captures requests, latency, errors, token/cost usage |
| Weights & Biases (W&B) | Tracks metrics, fine-tuning, prompt comparisons |
| Langfuse | Open-source, self-hostable; monitors latency, cost, scores |
Most are easy to integrate. For instance:
# W&B tracing
import wandb
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
# Langfuse tracing
from langfuse.callback import CallbackHandler
chain.run(..., callbacks=[CallbackHandler()])
Many tools support on-premise deployment, essential for privacy-sensitive applications. With these platforms, you can visualize agent execution, detect loops, analyze latency, and share traces with collaborators for improvements.
LangSmith: Debugging, Evaluating, and Monitoring LLMs
LangSmith, developed by LangChain AI, is a framework for debugging, testing, evaluating, and monitoring LLM applications. It’s designed for MLOps, helping developers take LLMs from prototype to production by providing tools to optimize latency, cost, and hardware efficiency. Its intuitive interface also lowers the barrier for non-software developers.
Key Features:
- Log traces from LangChain agents, chains, and components
- Create datasets to benchmark model performance
- Configure AI-assisted evaluators to grade models
- View metrics, visualizations, and feedback to iterate and improve
LangSmith covers the full MLOps workflow for LLMs, from debugging to optimization, and integrates tightly with LangChain to enhance development experience.
Metrics and Dashboard
LangSmith provides a rich monitoring dashboard with graphs for key statistics, which can be broken down by time intervals:
| Category | Metrics |
|---|---|
| Statistics | Trace Count, LLM Call Count, Trace Success Rates, LLM Call Success Rates |
| Latency | Trace Latency, LLM Latency, LLM Calls per Trace |
| Tokens | Total Tokens, Tokens per Trace, Tokens per LLM Call, Tokens/sec |
| Streaming | % Traces w/ Streaming, % LLM Calls w/ Streaming, Trace Time-to-First-Token, LLM Time-to-First-Token |
Example: Tracing a benchmark dataset run captures detailed execution steps of LLMs:

Deployment and Alternatives
- LangSmith itself is not open-source, but supports self-hosting for privacy-conscious organizations.
- Alternatives with overlapping features include: Langfuse, Weights & Biases, Datadog APM, Portkey, and PromptWatch.
- LangSmith is highlighted for its comprehensive evaluation and monitoring features and tight integration with LangChain.
Next, we’ll explore PromptWatch.io for prompt tracking in production LLM environments.
PromptWatch: Tracking Prompts and Outputs
PromptWatch enables detailed tracking of prompts and generated outputs for LLMs in production, providing visibility into inputs, outputs, execution, and costs.
Example: Setting up PromptWatch with LangChain:
from langchain import LLMChain, OpenAI, PromptTemplate
from promptwatch import PromptWatch
from config import set_environment
set_environment() # Sets API keys in the environment
prompt_template = PromptTemplate.from_template("Finish this sentence {input}")
my_chain = LLMChain(llm=OpenAI(), prompt=prompt_template)
with PromptWatch() as pw:
my_chain("The quick brown fox jumped over")
Using PromptWatch.io, developers can:
- Track all aspects of LLM chains: actions, retrieved documents, inputs/outputs, execution time, and tool details
- Analyze and troubleshoot using a visual, user-friendly interface
- Optimize prompt templates and monitor costs
- Conduct unit testing and version control for prompt templates
Summary
Deploying LLMs in production is complex but manageable with careful consideration of:
- Data quality, bias, and ethics
- Regulatory compliance and interpretability
- Resource management and ongoing monitoring
Key takeaways from LLM evaluation and monitoring:
- Evaluation: Compare models using criteria like string matching, semantic similarity, and performance metrics to ensure outputs are accurate and relevant
- Monitoring: Essential for tracking LLM performance, detecting anomalies, and maintaining reliability
- Tools:
- LangSmith: Tracks, benchmarks, and optimizes LLMs with automated evaluators and visual dashboards
- PromptWatch: Provides complete visibility into prompts, outputs, and chain execution
Effective monitoring and evaluation help maintain trust, efficiency, and performance of LLM applications in production environments.