How to Correctly Build AI Features in .NET

How to Correctly Build AI Features in .NET 

In the previous post, we focused on the foundations: what an LLM is (tokens, embeddings, transformer layers), and why provider abstraction matters.

In this post, we shift from understanding to building. Here, I show you how to turn Microsoft.Extensions.AI into a production-ready layer, including clear contracts, composable middleware pipelines, caching, OpenTelemetry, function (tool) invocation, streaming, evaluation, and practical patterns versus anti-patterns. By the end, you can move from “I can call a model” to “I can ship a resilient, observable, future proof AI feature set” without tying yourself to a single SDK.

Microsoft.Extensions.AI: Architectural Overview

Before diving into specific interfaces or patterns, it’s critical to grasp where Microsoft.Extensions.AI sits within the modern .NET AI stack.

Microsoft.Extensions.AI: Architectural Overview

The stack is layered by responsibility:

  • Your .NET application orchestrates business logic and user interaction.
  • Semantic Kernel optionally provides prompt orchestration, planning, and vector memory, but delegates actual language model operations to below.
  • Microsoft.Extensions.AI supplies the key contracts (IChatClientIEmbeddingGenerator) and the composable pipeline (middleware, DI, telemetry).
  • Provider SDKs implement those contracts, wrapping specific model APIs (OpenAI, Azure OpenAI, Ollama, etc).
  • Models are the actual LLMs or embedding models running in the cloud or on-prem.

This architecture means all AI feature code in your app can target a single set of interfaces and patterns, fostering interoperability, testability, and maintainability across the rapidly evolving AI landscape.

IChatClient Interface

The cornerstone of Microsoft.Extensions.AI is the IChatClient interface, an abstraction for any service that exposes chat completion capabilities ranging from cloud LLMs to self-hosted model endpoints. By implementing this contract, a provider enables seamless, composable, and type-safe chat AI integration within .NET apps.

Key Methods

  • Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> messages, ...)
    • Synchronously obtains a full model response to a set of chat messages.
  • IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(IEnumerable<ChatMessage> messages, ...)
    • Yields incremental output as it streams from the model—critical for conversational, real-time UIs.
  • object GetService(Type serviceType, object serviceKey) / T GetService<T>(object serviceKey)
    • Allows retrieval of provider-specific “side-car” services—e.g., metrics, extension APIs.

Anatomy of a Chat

A chat interaction is multi-modal and structured. Each message is a ChatMessage with:

  • Role: (UserAssistantTool, etc.)
  • Contents: a collection of polymorphic AIContent (text, images, audio, function calls, etc.)

Example (multi-message, multi-modal chat):

C#
var history = new List<ChatMessage>
{
    new(ChatRole.User, new TextContent("Describe this image, please."))
    {
        Contents = { new ImageContent(new Uri("https://example.com/myImage.jpg")) }
    },
    new(ChatRole.Assistant, new TextContent("That looks like a mountain landscape."))
};

Supported Content Types (AIContent Hierarchy)

Content TypeDescriptionExample Construction
TextContentSimple textnew TextContent { Text = "Hello world." }
ImageContentImages via URL or byte[]new ImageContent(uri, "image/png")
AudioContentAudio payloadsnew AudioContent(uri, "audio/wav")
UsageContentToken usage and cost reportingnew UsageContent(new UsageDetails{...})
FunctionCallContentFunction calls invoked by modelsnew FunctionCallContent("fx12", ... )
FunctionResultContentResults returned to models from function callsnew FunctionResultContent("fx12", ... )

ℹ️ Use Contents to enable multi-modal and structured conversations, including tool invocation and chaining. This unlocks far more potential than plain text-based chat.

Implementation Example

Below is a simplified custom chat client implementation:

C#
public sealed class SampleChatClient(Uri endpoint, string modelId) : IChatClient
{
    // Basic metadata</em>
    public ChatClientMetadata Metadata { get; } = new(...);

    public async Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> messages, ...)
    {
        <em>// Implement call to model endpoint</em>
        var result = await CallYourModelAsync(messages);
        return new ChatResponse(new ChatMessage(ChatRole.Assistant, result));
    }

    public async IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(
        IEnumerable<ChatMessage> messages, ...)
    {
        foreach (var token in StreamFromModelAPI(messages))
            yield return new ChatResponseUpdate(ChatRole.Assistant, token);
    }
    
    // IServiceProvider-style features supported for advanced cases
    public object? GetService(Type serviceType, object? serviceKey) => this;
}

Implementing IChatClient provides immediate compatibility with all dependent libraries and orchestration systems that consume this contract, such as Semantic Kernel.

Thread Safety and Best Practices

  • Thread-Safe: All IChatClient implementations should be thread-safe for concurrent operations.
  • Options Mutation: Since arguments (like ChatOptions) may be mutated, never share option instances across concurrent method calls.
  • Disposal: Never dispose an in-use IChatClient (as it inherits IDisposable).

IEmbeddingGenerator Interface

Embeddings are vector representations of text (or other content), which are crucial for semantic search, RAG, clustering, and other applications. In Microsoft.Extensions.AI, the IEmbeddingGenerator<TInput, TEmbedding> interface provides a robust abstraction for embedding services.

  • TInput: The kind of input accepted (e.g., string).
  • TEmbedding: The type of embedding produced (must inherit from Embedding).
C#
public interface IEmbeddingGenerator<in TInput, TEmbedding> : IDisposable
    where TEmbedding : Embedding
{
    Task<IEnumerable<TEmbedding>> GenerateAsync(
        IEnumerable<TInput> texts,
        EmbeddingGenerationOptions options,
        CancellationToken cancellationToken = default
    );
}

Integrating a Provider: Sample with OllamaSharp

Suppose you have a local LLM endpoint running via Ollama (or LM Studio). Here’s how you might consume it via Microsoft.Extensions.AI:

C#
var httpClient = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
var ollamaClient = new OllamaApiClient(httpClient);
ollamaClient.SelectedModel = "nomic-embed-text";

var embeddingOptions = new EmbeddingGenerationOptions { Dimensions = 384 };
var texts = new[] { "hello world", "semantic search in .NET" };

var embeddings = await ollamaClient.GenerateEmbeddingsAsync(texts, embeddingOptions);

ℹ️ All IEmbeddingGenerator implementations must be thread-safe and must not mutate shared options unless guaranteed by construction. Consider this when using option pooling or per-request customization.

Dependency Injection Patterns in Microsoft.Extensions.AI

Like other libraries in the Microsoft.Extensions family, Microsoft.Extensions.AI is built for robust DI integration, mirroring patterns from ASP.NET Core and Entity Framework Core.

Standard Registration

C#
services.AddOpenAIChatClient("gpt-4o", openAiApiKey); // From Microsoft.Extensions.AI.OpenAI package
services.AddAzureOpenAIEmbeddingGenerator("text-embed", endpoint, azureKey);

With registration in IServiceCollection, you can request typed instances anywhere via constructor injection:

C#
public class MyService
{
    private readonly IChatClient _chat;

    public MyService(IChatClient chat)
    {
        _chat = chat;
    }
}

Key Best Practices

  • Use DI everywhere: Avoid static factories or singletons outside the context that DI provides.
  • Testability: By targeting abstractions, swap in mocks or alternate provider implementations for integration testing.
  • Multi-client management: Register multiple clients for different providers or models under different keys for modular scaling.

Middleware and Pipeline Customization

Composable Middleware: The UseXxx Pattern

One of the major architectural advances with Microsoft.Extensions.AI is the introduction of middleware pipelines for AI operations, strongly inspired by ASP.NET Core’s UseXxx() model.

Common Middleware Extensions

Middleware ExtensionPurpose
.UseFunctionInvocation()Automatic tool/function invocation
.UseOpenTelemetry(…)Attach tracing, metrics (OpenTelemetry)
.UseCaching(…)Response caching (in-memory or distributed)
.UseCustomMiddleware(handler)Plug in custom pre/post-invocation logic

This is accomplished via the ChatClientBuilder, which wraps the base client and builds a decorated client by chaining middleware:

C#
IChatClient client = new ChatClientBuilder(baseClient)
    .UseFunctionInvocation()
    .UseCaching()
    .UseOpenTelemetry(loggerFactory, sourceName, configure: c => ...)
    .Build();

ℹ️ Middleware can be used to inject cross-cutting concerns (logging, caching, security, fallback handling) without polluting business logic or model provider code. Keep your pipeline explicit and readable for reliable, maintainable AI deployments.

Telemetry Integration with OpenTelemetry

Telemetry is indispensable for modern AI workloads. Without it, you’re left guessing about performance, cost, and failure patterns.

Microsoft.Extensions.AI supports first-class integration with OpenTelemetry for tracing, metrics, and logging. You instrument your pipeline with .UseOpenTelemetry(), and the telemetry will track spans related to chat completions, function calls, and pipeline steps:

C#
var tracerProvider = Sdk.CreateTracerProviderBuilder()
    .AddSource(sourceName)
    .AddConsoleExporter()
    .Build();

IChatClient client = new ChatClientBuilder(openaiClient)
    .UseFunctionInvocation()
    .UseOpenTelemetry(loggerFactory, sourceName, configure: c => c.EnableSensitiveData = false)
    .Build();

OpenTelemetry for Embeddings

Similarly, embedding generators can be instrumented:

C#
embeddingGeneratorBuilder.UseOpenTelemetry(loggerFactory, sourceName);

ℹ️ OpenTelemetry integration follows the OpenTelemetry Semantic Conventions for Generative AI systems, ensuring consistent, actionable metrics across observability platforms. One caveat: currently, trace correlation for function calling may require adjustments and traces from nested calls may not always share the same trace ID (but the ecosystem is actively evolving).

Caching Strategies for AI Responses

AI requests can be expensive and slow, so caching is essential for efficiency, cost control, and predictable performance.

Built-In Caching

Microsoft.Extensions.AI provides pluggable response caching at multiple layers:

  • In-Memory Caching: Fastest, transient—best for single-instance deployments.
  • Distributed Caching: Use Redis, Azure Storage, etc., for multi-instance or cloud scenarios.

Attach caching via middleware:

C#
var client = new ChatClientBuilder(baseClient)
    .UseCaching()
    .Build();

For production-grade systems, consider distributed caching to guarantee cache consistency and avoid cold starts across a farm of computation nodes.

ℹ️ Cache AI responses wherever possible, but always weigh up cache staleness (especially with dynamic or identity-sensitive prompts). Cache invalidation strategies are as essential as cache insertion.

Automatic Function Tool Invocation

Tool calling (or function calling) lets AI models invoke structured application code and unlocking advanced automation and orchestration scenarios. With Azure OpenAI, OpenAI, Ollama, and others now supporting this natively, the line between LLM chat and application logic is blurring.

How It Works

  • Define functions in .NET that you want the AI to call.
  • Use AIFunctionFactory to expose those as tool definitions with JSON schemas for parameter validation and documentation.
  • Register those functions in your chat pipeline.
  • The model can now issue a FunctionCallContent in a chat to marshal, invoke, and return the result to the chat context.

Example: Registering .NET Functions for Tool Calling

C#
// Function to expose</em>
int GetWeatherForCity(string cityName) => WeatherService.Get(cityName);

// Expose as AI Function</em>
var tool = AIFunctionFactory.Create((Func<string, int>)GetWeatherForCity);

// Add to chat options</em>
var chatOptions = new ChatOptions
{
    Tools = { tool }
};

When an LLM (with tool calling capability) decides to call GetWeatherForCity with { "cityName": "Berlin" }, your pipeline executes the method and streams the result back to the model for further chat reasoning.

AspectValue
Registration[AIFunctionFactory.Create] lets you wrap delegates or MethodInfo as tool-callable methods.
Argument MarshallingAutomatically serializes parameters and deserializes results via JSON Schema.
SafetyType-safe binding and parameter validation via JSONSchema ensures no malformed calls/payloads go to your code.
TracingOpenTelemetry spans can capture the full function call graph via middleware chain.

ℹ️ Always use JSONSchema for parameters and implement validation logic. Validate model requests and never execute untrusted methods with arbitrary payloads guided solely by the LLM.

Stateless vs. Stateful Client Design

The Core Distinction

  • Stateless Client: No user or session-specific context is retained. Each request is independent.
    • Pros: Simplicity, high-throughput, easy scaling.
    • Cons: No personalization/context, limited conversational continuity.
  • Stateful Client: Retains context across interactions, for instance, conversation history, per-user session logic.
    • Pros: Personalized, contextually relevant responses; better UX for chatbots/agents.
    • Cons: Requires robust context management and potentially more complex scaling strategies.
FeatureStatelessStateful
Context RetentionNoneMaintains per-session/user context
ScalingSimple (horizontal scaling is trivial)Requires context storage or affinity
Best Use CasesAPIs, search, batch tasksChatbots, personal assistents, copilot UX
RisksInflexible UXStale/incorrect context if not managed well

ℹ️ Microsoft.Extensions.AI supports both client types. Choose based on application complexity and user experience goals.

Streaming Chat Response Handling

For rich interactive applications, streaming chat responses transforms user perception of speed and engagement. Instead of waiting for the full answer, yield tokens/messages as soon as available.

Example: Streaming With GetStreamingResponseAsync

C#
await foreach (var update in chatClient.GetStreamingResponseAsync(messages, chatOptions))
{
    Console.Write(update.Content?.Text);
}
  • Each ChatResponseUpdate provides a fragment of the full model output.
  • Perfect for real-time UIs, CLI tools, and responsive agent experiences.

ℹ️ Always stream user-facing or high-latency AI responses unless a blocking, full-batch answer is business critical. This allows for smoother user interactions and perceived responsiveness.

Telemetry, Evaluation, and Reporting

Evaluation of AI-generated responses is a critical aspect of modern AI applications. The Microsoft.Extensions.AI.Evaluation libraries provide a set of abstractions and tools for automating the measurement of relevance, accuracy, completeness, and safety of outputs.

  • Test integration: Run evaluation as part of your CI/CD or test pipelines with MSTest, xUnit, etc.
  • Comprehensive metrics: Assess not just if an answer was generated, but if it meets quality, security, and user expectations.
  • Caching & reporting: Evaluate, store, and report on cached AI responses for auditing and continuous improvement.
  • Extensibility: Add custom evaluators specific to your domain.

ℹ️ For production environments, integrate automatic evaluation to assess both model and pipeline regression, and monitor metrics over time.

Patterns vs. Anti-Patterns: Table of Practices

Pattern / Anti-PatternDescription & Recommendations
Pattern: Always use IChatClient/IEmbeddingGenerator contractsEnsures future-proofing, testability, and provider swapping.
Anti-Pattern: Hard-coding provider-specific clientsLocks you in, eliminates composability, breaks DI and pipeline facilities.
Pattern: Compose middleware via UseXxx()Decouple cross-cutting concerns, enable pipeline customization.
Anti-Pattern: Embedding logic for caching/logging/telemetry inside business logicReduces modularity, bloats code, hampers observability.
Pattern: Streaming responses for real-time UIsMaximizes UX responsiveness and engagement.
Anti-Pattern: Full-batch blocking calls for conversational UIsDegrades perceived speed and limits interactivity.
Pattern: Parameterize all tool (function) calls with JSON SchemaEnforces security, robustness, and discoverability.
Anti-Pattern: Accepting unvalidated model/tool input parametersSecurity vulnerability—risk of arbitrary code execution or data leaks.
Pattern: Opt-in to OpenTelemetry span correlation for all layersEnables comprehensive monitoring and incident debugging.
Anti-Pattern: No telemetry, or using ad-hoc logging onlyLeaves production black boxes and hinders troubleshooting.

Conclusion

Microsoft.Extensions.AI provides a composable, consistent and future-ready abstraction on how .NET developers harness AI. Its abstractions enable rapid experimentation with new models and services, and its pipeline patterns provide the guardrails for scaling features reliably.

By adopting the best practices and patterns detailed in this guide, you can confidently build AI features that are as maintainable, testable, and observable as the rest of your .NET solutions. As the generative AI landscape continues to accelerate, investing in these abstractions now will pay massive dividends for your engineering team and your users.

Go build something extraordinary now, but be sure to keep the following in mind:

  1. Target the Abstractions: Write all cross-component logic and orchestration against IChatClient and IEmbeddingGenerator. This secures your codebase from vendor or model lock-in.
  2. Compose Middleware Pipelines: Use UseXxx() middleware for telemetry, caching, and function/tool invocation. This promotes modularity and reusability.
  3. Instrument Everything: Activate OpenTelemetry spans for AI operations. Invest some time mapping span structure to your incident response dashboards.
  4. Go Stateless First, Stateful When Needed: Most API-centric use cases work best with stateless clients. Opt into stateful design only for chatbots, copilots, or when conversation context is essential.
  5. Prefer Streaming APIs for User-Facing Scenarios: Yield incremental updates wherever supported. Today’s users expect real-time interactions.
  6. Integrate Tool Calling Carefully: Harness function invocation, but always rigorously lock down JSON schemas and parameter validation.
  7. Cache Strategically: Use caching to offload model cost and latency. But, test cache key design and eviction paths thoroughly.
  8. Test and Evaluate Continuously: Adopt the evaluation libraries for quality, safety, and truthfulness checks — especially important if release cycles are tight or regulated.

Discover more from Roxeem

Subscribe now to keep reading and get access to the full archive.

Continue reading