Observability

LangSmith and Langfuse: Beginner to Advanced Monitoring

When building AI agents or LLM applications, it is important to understand how the system behaves. LangSmith and Langfuse are tools used to monitor prompts, trace workflows, debug errors, and evaluate AI responses. They help developers improve reliability and performance of AI applications.

Best for

Readers who want a practical, role-based learning guide with clear progression from fundamentals to advanced implementation.

Not ideal for

Visitors looking for a short definition page without examples, sections, or a guided learning path.

Why Monitoring AI Systems Is Important

AI systems often involve multiple steps such as prompt creation, tool calls, database queries, and reasoning chains.

Without monitoring tools, it becomes difficult to understand why the AI produced a certain response.

Observability platforms like LangSmith and Langfuse allow developers to track each step of an AI workflow.

User Question
      ↓
    Prompt
      ↓
   LLM Model
      ↓
 Tool / API Call
      ↓
 LangSmith / Langfuse Trace
      ↓
 Debug + Improve Prompt
      ↓
 Better Final Answer

What Is LangSmith?

LangSmith is a platform built by the LangChain team to help developers debug and evaluate LLM applications.

It records every step of the chain or agent workflow, allowing developers to inspect prompts, outputs, and reasoning.

This helps identify prompt issues, tool failures, and performance problems.

What Is Langfuse?

Langfuse is an open-source observability tool for LLM applications.

It helps track prompts, responses, token usage, latency, and workflow traces.

Developers can monitor production AI systems and understand how users interact with their models.

Simple Example Code

This simplified example shows how LangChain applications can send traces to observability platforms.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

question = "Explain AI agents in simple language"

response = llm.predict(question)

print(response)

Intermediate: Setting Up LangSmith Tracing

LangSmith traces every step of your LangChain application automatically once you set the environment variables. No code changes needed — it just works.

Each trace shows: the full prompt sent to the model, token counts, latency, tool calls, intermediate steps, and the final output.

This is the fastest way to understand why your agent gave a wrong or unexpected answer.

import os
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

# ── Step 1: Set LangSmith environment variables ──
# (Set these in your .env file or terminal before running)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "StudyAssistant-Dev"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"  # From smith.langchain.com
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# ── Step 2: Build a chain as normal — LangSmith traces automatically ──
prompt = ChatPromptTemplate.from_template(
    "You are a study assistant. Explain {topic} clearly in 3 bullet points."
)

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
parser = StrOutputParser()

chain = prompt | llm | parser

# ── Step 3: Run the chain ──
result = chain.invoke({"topic": "neural networks"})
print(result)

# After running, go to https://smith.langchain.com to see:
# - The full prompt that was sent
# - The model's token usage and cost
# - Latency for each step
# - The exact output at each stage of the chain

# ── Tagging runs for easier filtering ──
from langchain.callbacks import LangChainTracer

tracer = LangChainTracer(project_name="StudyAssistant-Dev")

# Run with metadata tags for filtering in the UI
result = chain.invoke(
    {"topic": "photosynthesis"},
    config={
        "callbacks": [tracer],
        "tags": ["science", "beginner"],
        "metadata": {"user_id": "student_42", "session": "session_001"}
    }
)
print(result)

Intermediate: Setting Up Langfuse Tracing

Langfuse is an open-source alternative to LangSmith. It can be self-hosted or used as a cloud service. It supports LangChain, OpenAI SDK, and custom integrations.

The Langfuse Python SDK wraps your LLM calls and sends trace data to your Langfuse dashboard automatically.

Use Langfuse when you need more control over your data, lower cost, or want to self-host your observability infrastructure.

from langfuse import Langfuse
from langfuse.callback import CallbackHandler
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
import os

# ── Step 1: Initialize Langfuse client ──
# Get keys from https://cloud.langfuse.com (or your self-hosted instance)
langfuse = Langfuse(
    public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.environ.get("LANGFUSE_SECRET_KEY"),
    host="https://cloud.langfuse.com"  # or your self-hosted URL
)

# ── Step 2: Create a callback handler for LangChain ──
langfuse_handler = CallbackHandler(
    public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.environ.get("LANGFUSE_SECRET_KEY"),
    session_id="student-session-001",
    user_id="student-42",
    trace_name="study-question"
)

# ── Step 3: Build chain as normal ──
prompt = ChatPromptTemplate.from_template("Explain {topic} simply.")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
chain = prompt | llm | StrOutputParser()

# ── Step 4: Run with Langfuse callback ──
result = chain.invoke(
    {"topic": "machine learning"},
    config={"callbacks": [langfuse_handler]}
)
print(result)

# ── Step 5: Log custom events (scores, user feedback) ──
# After the chain runs, you can log a quality score
trace = langfuse.trace(name="study-response")
trace.score(name="user-rating", value=0.9, comment="Clear and helpful")

# Flush to ensure all data is sent
langfuse.flush()

Intermediate: Analyzing Traces to Improve Prompts

Once traces are flowing into LangSmith or Langfuse, the real work begins: reading them to find problems and improve your system.

Look for: prompts that produce vague answers, tool calls that return errors, high-latency steps, and responses where the model repeated itself or went off-topic.

This systematic trace review process is how experienced teams continuously improve their AI applications.

Weekly Trace Review Process

1. Filter traces by low quality score or user complaints
2. Open each trace and read the full prompt sent to model
3. Identify: Was the prompt clear? Did tools work?
4. Find the most common failure pattern this week
5. Update the prompt or tool to fix the root cause
6. Re-run test cases with new prompt
7. Deploy if test scores improve
8. Monitor next week's traces for regression

from langfuse import Langfuse
from datetime import datetime, timedelta

langfuse = Langfuse(
    public_key="your-public-key",
    secret_key="your-secret-key"
)

# ── Fetch recent traces for analysis ──
def get_low_quality_traces(min_score: float = 0.6, days_back: int = 7):
    """Find traces with quality scores below threshold."""
    
    # Get traces from the past week
    traces = langfuse.get_traces(
        limit=100,
        from_timestamp=datetime.now() - timedelta(days=days_back)
    )
    
    low_quality = []
    for trace in traces.data:
        scores = [s.value for s in (trace.scores or []) if s.name == "quality"]
        if scores and min(scores) < min_score:
            low_quality.append({
                "id": trace.id,
                "name": trace.name,
                "score": min(scores),
                "input": str(trace.input)[:100],
                "output": str(trace.output)[:100],
                "latency_ms": trace.latency
            })
    
    return sorted(low_quality, key=lambda x: x["score"])

# ── Analyze latency hotspots ──
def find_slow_traces(max_latency_ms: int = 3000):
    """Find traces where latency is too high."""
    traces = langfuse.get_traces(limit=50)
    
    slow = [
        {"id": t.id, "latency_ms": t.latency, "name": t.name}
        for t in traces.data
        if t.latency and t.latency > max_latency_ms
    ]
    return sorted(slow, key=lambda x: x["latency_ms"], reverse=True)

# Run analysis
print("Low quality traces:")
low_q = get_low_quality_traces(min_score=0.7)
for t in low_q[:5]:
    print(f"  [{t['score']:.1f}] {t['name']}: {t['input'][:50]}")

Advanced: Automated Evaluation with LangSmith

Instead of manually reviewing traces, you can set up automated evaluators that score every response automatically using an LLM-as-judge pattern.

Define evaluation criteria (correctness, helpfulness, safety). An evaluator LLM reads the question and response, then scores it 0-1 against each criterion.

Run the evaluator on datasets. Compare scores before and after prompt changes to track improvement objectively.

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

client = Client()
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ── Step 1: Create a dataset of test cases in LangSmith ──
dataset_name = "StudyAssistant-EvalSet"

# Only create dataset if it doesn't already exist
if not any(d.name == dataset_name for d in client.list_datasets()):
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Evaluation set for study assistant prompts"
    )
    
    # Add test cases (input + expected reference output)
    examples = [
        {
            "inputs": {"topic": "photosynthesis"},
            "outputs": {"answer": "Photosynthesis uses sunlight, water, CO2 to make glucose and oxygen in plants."}
        },
        {
            "inputs": {"topic": "Newton's laws"},
            "outputs": {"answer": "Newton's three laws describe inertia, force=mass*acceleration, and action-reaction."}
        },
        {
            "inputs": {"topic": "DNA"},
            "outputs": {"answer": "DNA stores genetic information using base pairs A-T and C-G in a double helix."}
        }
    ]
    client.create_examples(dataset_id=dataset.id, examples=examples)
    print(f"Created dataset with {len(examples)} examples")

# ── Step 2: Define your chain to evaluate ──
chain_prompt = ChatPromptTemplate.from_template(
    "You are a science tutor. Explain {topic} clearly in 2-3 sentences."
)
chain = chain_prompt | llm | StrOutputParser()

def run_chain(inputs: dict) -> dict:
    result = chain.invoke(inputs)
    return {"answer": result}

# ── Step 3: Set up evaluators ──
evaluators = [
    LangChainStringEvaluator("cot_qa"),          # Correctness evaluator
    LangChainStringEvaluator("criteria", config={"criteria": "helpfulness"}),
    LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),
]

# ── Step 4: Run the evaluation ──
eval_results = evaluate(
    run_chain,
    data=dataset_name,
    evaluators=evaluators,
    experiment_prefix="science-tutor-v1",
    metadata={"model": "gpt-3.5-turbo", "prompt_version": "v1"}
)

print("Evaluation complete. View results at https://smith.langchain.com")

Advanced: Cost Monitoring and Budget Controls

Every LLM API call costs money. Without monitoring, costs can spike unexpectedly as your application scales or if a bug causes infinite retry loops.

Track token usage per user session, per feature, and per model. Set soft and hard limits. Alert when costs exceed thresholds.

Langfuse and LangSmith both show token usage per trace. You can also track costs in your own database for more granular control.

import os
from langfuse import Langfuse
from datetime import datetime, timedelta
from collections import defaultdict

langfuse = Langfuse(
    public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.environ.get("LANGFUSE_SECRET_KEY")
)

# Token cost estimates ($ per 1000 tokens, adjust to current pricing)
COST_PER_1K = {
    "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
    "gpt-4": {"input": 0.03, "output": 0.06},
    "gpt-4-turbo": {"input": 0.01, "output": 0.03},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Calculate estimated cost in USD for a single call."""
    if model not in COST_PER_1K:
        return 0.0
    rates = COST_PER_1K[model]
    cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
    return round(cost, 6)

def get_daily_cost_report(days_back: int = 7):
    """Generate a cost breakdown report by day and model."""
    traces = langfuse.get_traces(
        limit=500,
        from_timestamp=datetime.now() - timedelta(days=days_back)
    )
    
    daily_costs = defaultdict(lambda: defaultdict(float))
    total_tokens = defaultdict(int)
    
    for trace in traces.data:
        if not trace.usage:
            continue
        
        day = trace.start_time.strftime("%Y-%m-%d")
        model = getattr(trace.usage, "model", "unknown")
        input_t = getattr(trace.usage, "input", 0) or 0
        output_t = getattr(trace.usage, "output", 0) or 0
        
        cost = estimate_cost(model, input_t, output_t)
        daily_costs[day][model] += cost
        total_tokens[model] += (input_t + output_t)
    
    # Print report
    print("=== Daily Cost Report ===")
    grand_total = 0
    for day in sorted(daily_costs.keys()):
        day_total = sum(daily_costs[day].values())
        grand_total += day_total
        print(f"{day}: ${day_total:.4f}")
        for model, cost in daily_costs[day].items():
            print(f"   {model}: ${cost:.4f}")
    
    print(f"\nGrand Total ({days_back}d): ${grand_total:.4f}")
    
    # Budget alert
    DAILY_BUDGET = 1.00  # $1 per day threshold
    if grand_total / days_back > DAILY_BUDGET:
        print(f"WARNING: Average daily cost ${grand_total/days_back:.4f} exceeds ${DAILY_BUDGET} budget!")
    
    return daily_costs

get_daily_cost_report(days_back=7)

Project Milestones by Level

Beginner Project: Set up LangSmith on a simple LangChain app. Run it 20 times with different inputs. Review the traces and write down 3 observations about prompt quality or latency.

Intermediate Project: Set up Langfuse on a RAG application. Add user session IDs and question tags to every trace. Build a weekly trace review habit: each Friday, find the 3 lowest-quality traces and update the prompt to fix the root cause.

Advanced Project: Build an automated evaluation pipeline with 30+ test cases in a LangSmith dataset. Run it before every prompt deployment. Set budget alerts in Langfuse. Create a dashboard showing weekly average quality score, average latency, and daily cost trend.

Frequently Asked Questions

Do beginners need LangSmith or Langfuse?

Beginners do not need them immediately. They become important when building larger AI applications and agents.

What is observability in AI systems?

Observability means monitoring and understanding how an AI system behaves internally, including prompts, outputs, and workflow steps.

When should a team adopt advanced observability?

Adopt advanced observability when workflows affect real users, costs increase, or multiple team members ship prompt and logic updates regularly.