In the previous post, we focused on the foundations: what an LLM is (tokens, embeddings, transformer layers), and why provider abstraction matters.
In this post, we shift from understanding to building. Here, I show you how to turn Microsoft.Extensions.AI into a production-ready layer, including clear contracts, composable middleware pipelines, caching, OpenTelemetry, function (tool) invocation, streaming, evaluation, and practical patterns versus anti-patterns. By the end, you can move from “I can call a model” to “I can ship a resilient, observable, future proof AI feature set” without tying yourself to a single SDK.
Microsoft.Extensions.AI: Architectural Overview
Before diving into specific interfaces or patterns, it’s critical to grasp where Microsoft.Extensions.AI sits within the modern .NET AI stack.

The stack is layered by responsibility:
- Your .NET application orchestrates business logic and user interaction.
- Semantic Kernel optionally provides prompt orchestration, planning, and vector memory, but delegates actual language model operations to below.
- Microsoft.Extensions.AI supplies the key contracts (
IChatClient,IEmbeddingGenerator) and the composable pipeline (middleware, DI, telemetry). - Provider SDKs implement those contracts, wrapping specific model APIs (OpenAI, Azure OpenAI, Ollama, etc).
- Models are the actual LLMs or embedding models running in the cloud or on-prem.
This architecture means all AI feature code in your app can target a single set of interfaces and patterns, fostering interoperability, testability, and maintainability across the rapidly evolving AI landscape.
IChatClient Interface
The cornerstone of Microsoft.Extensions.AI is the IChatClient interface, an abstraction for any service that exposes chat completion capabilities ranging from cloud LLMs to self-hosted model endpoints. By implementing this contract, a provider enables seamless, composable, and type-safe chat AI integration within .NET apps.
Key Methods
Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> messages, ...)- Synchronously obtains a full model response to a set of chat messages.
IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(IEnumerable<ChatMessage> messages, ...)- Yields incremental output as it streams from the model—critical for conversational, real-time UIs.
object GetService(Type serviceType, object serviceKey)/T GetService<T>(object serviceKey)- Allows retrieval of provider-specific “side-car” services—e.g., metrics, extension APIs.
Anatomy of a Chat
A chat interaction is multi-modal and structured. Each message is a ChatMessage with:
- Role: (
User,Assistant,Tool, etc.) - Contents: a collection of polymorphic
AIContent(text, images, audio, function calls, etc.)
Example (multi-message, multi-modal chat):
var history = new List<ChatMessage>
{
new(ChatRole.User, new TextContent("Describe this image, please."))
{
Contents = { new ImageContent(new Uri("https://example.com/myImage.jpg")) }
},
new(ChatRole.Assistant, new TextContent("That looks like a mountain landscape."))
};Supported Content Types (AIContent Hierarchy)
| Content Type | Description | Example Construction |
|---|---|---|
TextContent | Simple text | new TextContent { Text = "Hello world." } |
ImageContent | Images via URL or byte[] | new ImageContent(uri, "image/png") |
AudioContent | Audio payloads | new AudioContent(uri, "audio/wav") |
UsageContent | Token usage and cost reporting | new UsageContent(new UsageDetails{...}) |
FunctionCallContent | Function calls invoked by models | new FunctionCallContent("fx12", ... ) |
FunctionResultContent | Results returned to models from function calls | new FunctionResultContent("fx12", ... ) |
ℹ️ Use
Contentsto enable multi-modal and structured conversations, including tool invocation and chaining. This unlocks far more potential than plain text-based chat.
Implementation Example
Below is a simplified custom chat client implementation:
public sealed class SampleChatClient(Uri endpoint, string modelId) : IChatClient
{
// Basic metadata</em>
public ChatClientMetadata Metadata { get; } = new(...);
public async Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> messages, ...)
{
<em>// Implement call to model endpoint</em>
var result = await CallYourModelAsync(messages);
return new ChatResponse(new ChatMessage(ChatRole.Assistant, result));
}
public async IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(
IEnumerable<ChatMessage> messages, ...)
{
foreach (var token in StreamFromModelAPI(messages))
yield return new ChatResponseUpdate(ChatRole.Assistant, token);
}
// IServiceProvider-style features supported for advanced cases
public object? GetService(Type serviceType, object? serviceKey) => this;
}Implementing IChatClient provides immediate compatibility with all dependent libraries and orchestration systems that consume this contract, such as Semantic Kernel.
Thread Safety and Best Practices
- Thread-Safe: All
IChatClientimplementations should be thread-safe for concurrent operations. - Options Mutation: Since arguments (like
ChatOptions) may be mutated, never share option instances across concurrent method calls. - Disposal: Never dispose an in-use
IChatClient(as it inheritsIDisposable).
IEmbeddingGenerator Interface
Embeddings are vector representations of text (or other content), which are crucial for semantic search, RAG, clustering, and other applications. In Microsoft.Extensions.AI, the IEmbeddingGenerator<TInput, TEmbedding> interface provides a robust abstraction for embedding services.
- TInput: The kind of input accepted (e.g., string).
- TEmbedding: The type of embedding produced (must inherit from
Embedding).
public interface IEmbeddingGenerator<in TInput, TEmbedding> : IDisposable
where TEmbedding : Embedding
{
Task<IEnumerable<TEmbedding>> GenerateAsync(
IEnumerable<TInput> texts,
EmbeddingGenerationOptions options,
CancellationToken cancellationToken = default
);
}Integrating a Provider: Sample with OllamaSharp
Suppose you have a local LLM endpoint running via Ollama (or LM Studio). Here’s how you might consume it via Microsoft.Extensions.AI:
var httpClient = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
var ollamaClient = new OllamaApiClient(httpClient);
ollamaClient.SelectedModel = "nomic-embed-text";
var embeddingOptions = new EmbeddingGenerationOptions { Dimensions = 384 };
var texts = new[] { "hello world", "semantic search in .NET" };
var embeddings = await ollamaClient.GenerateEmbeddingsAsync(texts, embeddingOptions);ℹ️ All
IEmbeddingGeneratorimplementations must be thread-safe and must not mutate shared options unless guaranteed by construction. Consider this when using option pooling or per-request customization.
Dependency Injection Patterns in Microsoft.Extensions.AI
Like other libraries in the Microsoft.Extensions family, Microsoft.Extensions.AI is built for robust DI integration, mirroring patterns from ASP.NET Core and Entity Framework Core.
Standard Registration
services.AddOpenAIChatClient("gpt-4o", openAiApiKey); // From Microsoft.Extensions.AI.OpenAI package
services.AddAzureOpenAIEmbeddingGenerator("text-embed", endpoint, azureKey);With registration in IServiceCollection, you can request typed instances anywhere via constructor injection:
public class MyService
{
private readonly IChatClient _chat;
public MyService(IChatClient chat)
{
_chat = chat;
}
}Key Best Practices
- Use DI everywhere: Avoid static factories or singletons outside the context that DI provides.
- Testability: By targeting abstractions, swap in mocks or alternate provider implementations for integration testing.
- Multi-client management: Register multiple clients for different providers or models under different keys for modular scaling.
Middleware and Pipeline Customization
Composable Middleware: The UseXxx Pattern
One of the major architectural advances with Microsoft.Extensions.AI is the introduction of middleware pipelines for AI operations, strongly inspired by ASP.NET Core’s UseXxx() model.
Common Middleware Extensions
| Middleware Extension | Purpose |
|---|---|
.UseFunctionInvocation() | Automatic tool/function invocation |
.UseOpenTelemetry(…) | Attach tracing, metrics (OpenTelemetry) |
.UseCaching(…) | Response caching (in-memory or distributed) |
.UseCustomMiddleware(handler) | Plug in custom pre/post-invocation logic |
This is accomplished via the ChatClientBuilder, which wraps the base client and builds a decorated client by chaining middleware:
IChatClient client = new ChatClientBuilder(baseClient)
.UseFunctionInvocation()
.UseCaching()
.UseOpenTelemetry(loggerFactory, sourceName, configure: c => ...)
.Build();ℹ️ Middleware can be used to inject cross-cutting concerns (logging, caching, security, fallback handling) without polluting business logic or model provider code. Keep your pipeline explicit and readable for reliable, maintainable AI deployments.
Telemetry Integration with OpenTelemetry
Telemetry is indispensable for modern AI workloads. Without it, you’re left guessing about performance, cost, and failure patterns.
Microsoft.Extensions.AI supports first-class integration with OpenTelemetry for tracing, metrics, and logging. You instrument your pipeline with .UseOpenTelemetry(), and the telemetry will track spans related to chat completions, function calls, and pipeline steps:
var tracerProvider = Sdk.CreateTracerProviderBuilder()
.AddSource(sourceName)
.AddConsoleExporter()
.Build();
IChatClient client = new ChatClientBuilder(openaiClient)
.UseFunctionInvocation()
.UseOpenTelemetry(loggerFactory, sourceName, configure: c => c.EnableSensitiveData = false)
.Build();OpenTelemetry for Embeddings
Similarly, embedding generators can be instrumented:
embeddingGeneratorBuilder.UseOpenTelemetry(loggerFactory, sourceName);ℹ️ OpenTelemetry integration follows the OpenTelemetry Semantic Conventions for Generative AI systems, ensuring consistent, actionable metrics across observability platforms. One caveat: currently, trace correlation for function calling may require adjustments and traces from nested calls may not always share the same trace ID (but the ecosystem is actively evolving).
Caching Strategies for AI Responses
AI requests can be expensive and slow, so caching is essential for efficiency, cost control, and predictable performance.
Built-In Caching
Microsoft.Extensions.AI provides pluggable response caching at multiple layers:
- In-Memory Caching: Fastest, transient—best for single-instance deployments.
- Distributed Caching: Use Redis, Azure Storage, etc., for multi-instance or cloud scenarios.
Attach caching via middleware:
var client = new ChatClientBuilder(baseClient)
.UseCaching()
.Build();For production-grade systems, consider distributed caching to guarantee cache consistency and avoid cold starts across a farm of computation nodes.
ℹ️ Cache AI responses wherever possible, but always weigh up cache staleness (especially with dynamic or identity-sensitive prompts). Cache invalidation strategies are as essential as cache insertion.
Automatic Function Tool Invocation
Tool calling (or function calling) lets AI models invoke structured application code and unlocking advanced automation and orchestration scenarios. With Azure OpenAI, OpenAI, Ollama, and others now supporting this natively, the line between LLM chat and application logic is blurring.
How It Works
- Define functions in .NET that you want the AI to call.
- Use
AIFunctionFactoryto expose those as tool definitions with JSON schemas for parameter validation and documentation. - Register those functions in your chat pipeline.
- The model can now issue a
FunctionCallContentin a chat to marshal, invoke, and return the result to the chat context.
Example: Registering .NET Functions for Tool Calling
// Function to expose</em>
int GetWeatherForCity(string cityName) => WeatherService.Get(cityName);
// Expose as AI Function</em>
var tool = AIFunctionFactory.Create((Func<string, int>)GetWeatherForCity);
// Add to chat options</em>
var chatOptions = new ChatOptions
{
Tools = { tool }
};When an LLM (with tool calling capability) decides to call GetWeatherForCity with { "cityName": "Berlin" }, your pipeline executes the method and streams the result back to the model for further chat reasoning.
| Aspect | Value |
|---|---|
| Registration | [AIFunctionFactory.Create] lets you wrap delegates or MethodInfo as tool-callable methods. |
| Argument Marshalling | Automatically serializes parameters and deserializes results via JSON Schema. |
| Safety | Type-safe binding and parameter validation via JSONSchema ensures no malformed calls/payloads go to your code. |
| Tracing | OpenTelemetry spans can capture the full function call graph via middleware chain. |
ℹ️ Always use JSONSchema for parameters and implement validation logic. Validate model requests and never execute untrusted methods with arbitrary payloads guided solely by the LLM.
Stateless vs. Stateful Client Design
The Core Distinction
- Stateless Client: No user or session-specific context is retained. Each request is independent.
- Pros: Simplicity, high-throughput, easy scaling.
- Cons: No personalization/context, limited conversational continuity.
- Stateful Client: Retains context across interactions, for instance, conversation history, per-user session logic.
- Pros: Personalized, contextually relevant responses; better UX for chatbots/agents.
- Cons: Requires robust context management and potentially more complex scaling strategies.
| Feature | Stateless | Stateful |
|---|---|---|
| Context Retention | None | Maintains per-session/user context |
| Scaling | Simple (horizontal scaling is trivial) | Requires context storage or affinity |
| Best Use Cases | APIs, search, batch tasks | Chatbots, personal assistents, copilot UX |
| Risks | Inflexible UX | Stale/incorrect context if not managed well |
ℹ️ Microsoft.Extensions.AI supports both client types. Choose based on application complexity and user experience goals.
Streaming Chat Response Handling
For rich interactive applications, streaming chat responses transforms user perception of speed and engagement. Instead of waiting for the full answer, yield tokens/messages as soon as available.
Example: Streaming With GetStreamingResponseAsync
await foreach (var update in chatClient.GetStreamingResponseAsync(messages, chatOptions))
{
Console.Write(update.Content?.Text);
}- Each
ChatResponseUpdateprovides a fragment of the full model output. - Perfect for real-time UIs, CLI tools, and responsive agent experiences.
ℹ️ Always stream user-facing or high-latency AI responses unless a blocking, full-batch answer is business critical. This allows for smoother user interactions and perceived responsiveness.
Telemetry, Evaluation, and Reporting
Evaluation of AI-generated responses is a critical aspect of modern AI applications. The Microsoft.Extensions.AI.Evaluation libraries provide a set of abstractions and tools for automating the measurement of relevance, accuracy, completeness, and safety of outputs.
- Test integration: Run evaluation as part of your CI/CD or test pipelines with MSTest, xUnit, etc.
- Comprehensive metrics: Assess not just if an answer was generated, but if it meets quality, security, and user expectations.
- Caching & reporting: Evaluate, store, and report on cached AI responses for auditing and continuous improvement.
- Extensibility: Add custom evaluators specific to your domain.
ℹ️ For production environments, integrate automatic evaluation to assess both model and pipeline regression, and monitor metrics over time.
Patterns vs. Anti-Patterns: Table of Practices
| Pattern / Anti-Pattern | Description & Recommendations |
|---|---|
Pattern: Always use IChatClient/IEmbeddingGenerator contracts | Ensures future-proofing, testability, and provider swapping. |
| Anti-Pattern: Hard-coding provider-specific clients | Locks you in, eliminates composability, breaks DI and pipeline facilities. |
Pattern: Compose middleware via UseXxx() | Decouple cross-cutting concerns, enable pipeline customization. |
| Anti-Pattern: Embedding logic for caching/logging/telemetry inside business logic | Reduces modularity, bloats code, hampers observability. |
| Pattern: Streaming responses for real-time UIs | Maximizes UX responsiveness and engagement. |
| Anti-Pattern: Full-batch blocking calls for conversational UIs | Degrades perceived speed and limits interactivity. |
| Pattern: Parameterize all tool (function) calls with JSON Schema | Enforces security, robustness, and discoverability. |
| Anti-Pattern: Accepting unvalidated model/tool input parameters | Security vulnerability—risk of arbitrary code execution or data leaks. |
| Pattern: Opt-in to OpenTelemetry span correlation for all layers | Enables comprehensive monitoring and incident debugging. |
| Anti-Pattern: No telemetry, or using ad-hoc logging only | Leaves production black boxes and hinders troubleshooting. |
Conclusion
Microsoft.Extensions.AI provides a composable, consistent and future-ready abstraction on how .NET developers harness AI. Its abstractions enable rapid experimentation with new models and services, and its pipeline patterns provide the guardrails for scaling features reliably.
By adopting the best practices and patterns detailed in this guide, you can confidently build AI features that are as maintainable, testable, and observable as the rest of your .NET solutions. As the generative AI landscape continues to accelerate, investing in these abstractions now will pay massive dividends for your engineering team and your users.
Go build something extraordinary now, but be sure to keep the following in mind:
- Target the Abstractions: Write all cross-component logic and orchestration against
IChatClientandIEmbeddingGenerator. This secures your codebase from vendor or model lock-in. - Compose Middleware Pipelines: Use
UseXxx()middleware for telemetry, caching, and function/tool invocation. This promotes modularity and reusability. - Instrument Everything: Activate OpenTelemetry spans for AI operations. Invest some time mapping span structure to your incident response dashboards.
- Go Stateless First, Stateful When Needed: Most API-centric use cases work best with stateless clients. Opt into stateful design only for chatbots, copilots, or when conversation context is essential.
- Prefer Streaming APIs for User-Facing Scenarios: Yield incremental updates wherever supported. Today’s users expect real-time interactions.
- Integrate Tool Calling Carefully: Harness function invocation, but always rigorously lock down JSON schemas and parameter validation.
- Cache Strategically: Use caching to offload model cost and latency. But, test cache key design and eviction paths thoroughly.
- Test and Evaluate Continuously: Adopt the evaluation libraries for quality, safety, and truthfulness checks — especially important if release cycles are tight or regulated.

