The world of AI is buzzing with a new paradigm: agentic AI. Instead of simply responding to a single prompt, these AI systems can reason, plan, use tools, and even critique their own work to accomplish complex, multi-step tasks. They represent a significant leap from the non-agentic models we’ve grown accustomed to.

As this technology matures, building effective and reliable agents has become a critical skill. AI luminary Andrew Ng, through his Coursera courses published one week ago and YouTube talks, has laid out a clear, practical framework for developing these intelligent systems. This playbook distills his key principles into actionable best practices for anyone looking to build the next generation of AI.

1. Design is an Art: Master the Workflow

The first and most crucial lesson is that there’s no single “correct” way to design an agent. Crafting the agent’s workflow is more of an art than a science, but it’s guided by a core principle: be concrete.

Instead of feeding a model a massive, complex prompt and hoping for the best, break the task down into a logical sequence of smaller, well-defined steps. This “chain of thought” or planning phase is where the magic happens. A typical agentic pattern involves:

  • Planning: The agent outlines the steps it needs to take to achieve a goal.

  • Action: The agent executes a step, which might involve writing code, searching the web, or using a specific tool.

  • Observation: The agent analyzes the result of its action.

  • Reflection: The agent critiques its own progress and refines the plan.

By defining a clear goal for each step, you create a more robust and predictable system.

2. Reflection: The Engine of Improvement

If there’s one secret weapon in the agentic AI toolkit, it’s reflection. The performance jump from a model like GPT-3.5 to GPT-4 is enormous, requiring vast resources and training data. However, as Ng points out, you can achieve staggering performance gains from a single model simply by implementing a reflection step.

Here’s how it works:

  • The agent completes a task and generates an initial output.

  • The agent (or a separate “critic” agent) is then prompted to evaluate that output against a set of criteria. Did it follow all instructions? Is the code efficient?

  • The agent refines its initial work based on this self-generated critique.

This simple loop of “do, then critique and redo” forces the model to catch its own errors and iteratively improve its output, leading to a much higher-quality result without needing a more powerful base model.

3. Tool Use: Give Your Agent Superpowers

An LLM’s knowledge is limited to its training data. To perform real-world tasks, it needs tools. Agentic frameworks allow models to access custom tools that dramatically expand their capabilities, such as:

  • Reading and writing to a calendar.

  • Searching a private knowledge base.

  • Converting PDFs to text.

  • Executing code (this is the most powerful one).

However, with great power comes great responsibility. It is absolutely critical to run agents with powerful tools—especially those that can interact with file systems or external APIs—in a secure sandbox environment. This prevents catastrophic accidents, like an agent deleting an entire directory while trying to complete a simple task.

4. Evaluation: The Hardest, Most Important Part

How do you know if your agent is actually any good? Evaluation is notoriously difficult but is the bedrock of a successful project. Ng proposes a two-axis framework to think about this:

Axis 1: The Judge

  • Evaluation with Code (Objective): For tasks with a clear right or wrong answer (e.g., solving a math problem, writing code that must compile), you can write scripts to automatically verify the output.

  • LLM as Judge (Subjective): For more nuanced tasks (e.g., judging the quality of a marketing email), you can use another LLM to act as a judge. The key here is to provide the judge LLM with a concrete rubric. Don’t just ask, “Is this good?” Ask, “On a scale of 1-5, how well does this email meet the following criteria: a) clear call-to-action, b) professional tone, c) adherence to brand voice?”

Axis 2: The Ground Truth

  • Per-Example Ground Truth: You have a dataset where you know the “correct” answer for every example.

  • No Per-Example Ground Truth: You don’t have perfect answers, so you rely on rubrics, principles, or comparisons.

The most effective approach is to start by manually inspecting the agent’s outputs. Create a small dataset of examples (even just 10-20) to track your progress and guide your improvements.

The Iterative Development Loop

  • Building an agent is not a one-shot process. The key is to embrace a rapid, iterative loop:

  • Build: Get a simple, end-to-end version of your agent working as quickly as possible.

    • Analyze: This is the most critical phase.

    • Examine Traces: Look at the agent’s step-by-step reasoning. Where did it go wrong?

    • Error Analysis: Categorize the failures. Is the agent failing at the planning stage? Is a specific tool unreliable?

    • Compute Metrics: Use your evaluation sets to get a quantitative measure of performance.

  • Improve: Based on your analysis, prioritize what to fix. Your options include:

    • Prompt Engineering: Tweak the instructions for a specific step.

    • Swap the Model: Different models have different strengths. Some are better at following complex instructions, while others are more creative. Design your system to make swapping models easy.

    • Fine-Tune: For specialized tasks, you may need to fine-tune a model on your specific data.

A final piece of advice from Ng: don’t worry about latency and cost at the beginning. Getting the quality right is the hardest part. Speeding it up and making it cheaper is a good problem to have later on. By focusing on workflow, reflection, and rigorous evaluation, you can build truly intelligent agents that go far beyond the capabilities of simple chatbots.