AI Agents

How AI Agents Improve Over Time in Global Financial Services

Eric Lam

May 8, 2026 · 17 min read

A practical guide to governed agent learning in regulated environments.

A question I often hear from financial-services customers is:

How do AI agents improve over time through interactions with customers, users, and feedback data?

The short answer is: not by secretly retraining themselves on live customer conversations.

In a regulated financial-services environment, the foundation model remains fixed during inference. What improves over time is the agent system around the model: the context it receives, the memory it is allowed to retain, the knowledge it retrieves, the skills and tools it can invoke, the evaluations used to measure it, and the governed release process used to update it.

That distinction matters. It is what makes agent improvement governable, auditable, reversible, and acceptable to Model Risk Management, Operational Risk, Compliance, Privacy, Legal, and regional regulators.

The practical answer is the governed improvement flywheel.

The Governed Improvement Flywheel

A regulated financial-services firm should frame agent improvement as a controlled release cycle, not as autonomous self-learning.

The agent may interact with users every day. It may collect telemetry, user feedback, tool traces, retrieved documents, human-review decisions, and outcome data. But that data should not silently update the model’s weights in production.

Instead, improvement should move through six governed stages.

Governed Improvement Flywheel: a circular flow of six stages — Interact, Instrument, Evaluate, Diagnose, Improve, Release.

The six governed stages of agent improvement. The model stays fixed; the system around it is what improves.

Stage 1: Interact

The agent serves a customer, advisor, analyst, operations colleague, claims handler, fraud investigator, compliance user, or relationship manager.

At this stage, the agent should only use the data, memory, knowledge, tools, and skills that the user and the agent are permitted to access.

For example, an advisor copilot should not have unrestricted access to all client records. It should operate within the advisor’s entitlement, the customer’s consent status, the relevant jurisdiction, and the firm’s policies for advice, suitability, privacy, and record keeping.

Stage 2: Instrument

Every interaction should generate traceable telemetry.

A useful trace captures:

user intent;
retrieved sources;
memory reads and writes;
tool calls;
guardrail events;
model used;
prompts and configuration versions;
citations shown to the user;
latency and cost;
human overrides;
escalation outcomes;
user feedback.

This telemetry is not the same as learning. It is the raw material for learning.

A bank-grade agent does not improve simply because logs exist. It improves when those logs are converted into evaluation evidence, root-cause analysis, and approved change.

One important nuance: Bedrock foundation-model inference data handling should not be confused with the telemetry and state that an agent platform intentionally captures. An agentic system may store memory records, evaluation artefacts, traces, tool-call logs, CloudWatch Logs, and human-feedback records according to the firm’s configuration. Those artefacts are part of the governed improvement flywheel and must be treated as regulated data, with retention, residency, access control, deletion, and audit policies applied.

Stage 3: Evaluate

The interaction is scored using automated and human evaluation methods.

Low-risk workflows may use more automated evaluation. Higher-risk workflows — such as regulated advice, credit decisions, complaints, fraud, AML, or vulnerable-customer handling — need stronger deterministic checks and human review.

A mature evaluation framework should include:

golden test sets;
regression tests;
grounding checks;
policy-compliance checks;
deterministic business-rule checks;
LLM-as-judge checks where appropriate;
human review for high-risk cases;
production trace review;
A/B or champion/challenger testing;
statistical release gates;
post-release monitoring.

Agent quality must be measured at the workflow level, not just the final-answer level. The important question is not only “Was the final response fluent?” It is also: Did the agent retrieve the right evidence, use the right skill, call the right tool, respect entitlements, cite the right sources, and avoid unsafe action?

Amazon Bedrock AgentCore Evaluations became generally available in March 2026 and supports built-in evaluators across response quality, safety, task completion, and tool usage, plus Ground Truth and custom evaluators.

Stage 4: Diagnose

When an agent fails, the failure should not disappear into logs.

It should become a structured improvement item.

Typical root causes include:

missing or stale knowledge;
poor retrieval;
weak skill instructions;
incorrect tool output;
insufficient customer context;
ambiguous policy;
model reasoning error, such as incomplete task decomposition, poor tool sequencing, missed constraint, or unsupported inference;
guardrail failure;
poor user experience;
entitlement or access-control issue.

For example, if an advisor copilot gives an unsuitable product recommendation, the right question is not simply “Did the model hallucinate?” The real question is: where did the system fail?

Was the suitability skill incomplete? Was the customer memory stale? Was the wrong policy retrieved? Was the product rule ambiguous? Did the model ignore a constraint? Did the evaluation set miss a key scenario?

This diagnosis determines where the fix should be applied.

Stage 5: Improve

The fix should be applied at the smallest safe layer.

That may mean:

updating memory policy;
curating the knowledge base;
improving retrieval or reranking;
revising a skill;
updating a tool schema;
strengthening a guardrail;
adding an evaluation case;
adjusting a system prompt;
changing the model;
fine-tuning or distilling a model.

This is a critical design principle. In a regulated environment, not every improvement requires a new model. Often, the best fix is a better skill, a cleaner tool contract, a stronger retrieval filter, a new deterministic check, or an additional test case.

Stage 6: Release

Finally, the improved component is packaged as a versioned release.

It should be tested against golden datasets and regression suites, reviewed by the right control functions, deployed to a limited audience, monitored, and either rolled forward or rolled back based on evidence.

This is the difference between a bank-grade agent and a demo.

A demo improves when someone tweaks a prompt. A bank-grade agent improves when a controlled component is changed, tested, approved, monitored, and made reversible.

The Core Principle: Context Engineering, Not Silent Model Training

The most important idea behind the flywheel is context engineering.

The model’s weights do not change every time a customer speaks to the agent. The agent improves because better information is placed into the model’s context at the right time:

customer intent;
permissioned memory;
retrieved policy;
product rules;
market-specific regulation;
tool outputs;
risk controls;
previous decisions;
human feedback.

In a regulated environment, “the agent learns” should not mean uncontrolled self-training. It should mean:

the agent has better memory;
the agent retrieves better knowledge;
the agent invokes better skills and tools;
the agent is measured by better evaluations;
the agent is improved through controlled release management;
the underlying model is updated only through approved offline fine-tuning, distillation, or model replacement.

The safest formulation is:

The agent system improves continuously. The model changes only through approved offline processes.

The Five Improvement Levers

The governed flywheel is powered by five improvement levers: memory, knowledge, skills and tools, evaluation, and offline model optimisation.

The Five Improvement Levers: memory, knowledge, skills and tools, evaluation, and offline model optimisation — stacked within a governance boundary.

The five improvement levers. Levers 1–4 change live through governed releases; only lever 5 changes the model itself, and only offline.

1. Context and Memory: Consented Personalisation

Memory is the first improvement lever.

Within a session, the agent remembers what has been said, what has been decided, what questions remain open, and what evidence has already been checked. Across sessions — where consent, policy, retention, and access rights allow — the agent may retain selected facts, preferences, summaries, and prior decisions.

Language matters. A safer formulation than “the agent learns the customer as a person” is:

The agent builds a consented, scoped, purpose-limited customer context.

This is what produces the user and customer perception that “the agent is getting better” without implying that the underlying foundation model is changing.

Amazon Bedrock AgentCore Memory can be organised by actor, session, strategy, and namespace. AWS documentation also describes IAM policy patterns to restrict memory access by scopes such as actor, session, and namespace.

AgentCore Memory provides memory primitives and access-control integration points, but the firm remains responsible for defining what may be remembered, when memory may be written, who may retrieve it, how long it is retained, and how deletion or suppression requests are handled.

For example, an advisor copilot may remember, within permitted scope, that a client has cross-border needs, prefers concise summaries, has a household structure relevant to advice, and previously rejected a product due to liquidity concerns. That memory should be scoped to the advisor’s book, linked to the correct client identity, retained only under approved policy, and deleted or suppressed on customer request, privacy event, or regulatory obligation.

The governance point is simple:

Memory is customer data, not a productivity feature.

It requires consent, purpose limitation, minimisation, retention, residency controls, entitlement, explainability, and deletion workflows.

2. Knowledge: The Governed RAG and GraphRAG Flywheel

The second lever is knowledge.

The agent improves because its evidence base improves. New policies, product rules, regulatory circulars, suitability guidance, playbooks, approved case summaries, complaint themes, and operational procedures can be curated into the retrieval layer.

Better curated evidence leads to better grounded answers. Better outcomes create new approved artefacts. Those artefacts improve future retrieval.

But the wording must be controlled. A financial-services firm should not say:

Every resolved case becomes fuel for the next conversation.

A safer formulation is:

Approved and permissioned artefacts — for example, anonymised resolved cases, validated rationales, product guidance, and regulatory circulars — can be curated into the knowledge base with lineage, access control, retention, and residency rules applied.

Amazon Bedrock Knowledge Bases supports managed RAG. GraphRAG for Amazon Bedrock Knowledge Bases became generally available in March 2025 and uses Amazon Neptune Analytics to combine graph relationships with retrieval.

This matters in financial services because many questions are relationship-heavy. A complaints agent, for example, may need to retrieve the current complaints policy, related product rules, jurisdiction-specific timelines, previous approved complaint outcomes, and required evidence templates. The answer may depend on relationships between customer, account, product, region, complaint type, and regulatory obligation.

Semantic RAG and GraphRAG working together: a complaints query fans out to both a vector search over policy and evidence documents and a graph traversal over customer, account, product, region, complaint type, and regulatory obligation entities; the merged result becomes grounded, cited context for the agent.

Semantic RAG retrieves the right documents. GraphRAG retrieves the right relationships. Together they give the agent both evidence and entity context.

The governance point:

A knowledge base is not a dumping ground.

RAG quality depends on curation. It needs source approval, lineage, expiry dates, document ownership, access-control mapping, stale-content detection, and evaluation against golden test cases.

3. Skills and Tools: Reusable Capabilities, Not One-off Agents

The third lever is the most strategically important: skills and tools.

A global financial-services firm should not build a separate bespoke agent for every business problem. It should build reusable, governed capabilities that many agents can use.

A useful distinction is:

Skills are procedural knowledge: instructions, workflows, examples, templates, scripts, and standards that teach the agent how to perform a task reliably.
Tools are executable actions: APIs, Lambda functions, workflow calls, database queries, case creation, document extraction, or transaction checks.
Agents are orchestration layers: they decide what context, skill, model, and tool to use for a given task.

This is where improvement compounds.

If a firm improves a reusable suitability skill, every approved agent that uses that skill benefits. If a firm improves a KYC document-extraction tool, every onboarding workflow can benefit. If a firm improves a complaint-classification skill, multiple channels can reuse the same governed capability.

Four agents — advisor copilot, call-centre agent, digital onboarding, and RM agent — all depend on a single versioned Suitability skill in the AgentCore Gateway skill registry. Improving the skill once means all four inherit v2.4 automatically.

One governed skill, many dependent agents. A single MRM-approved release of v2.4 propagates to every approved agent that consumes it.

Examples of reusable financial-services skills include:

suitability checks across jurisdictions;
source-of-wealth reasoning;
KYC and KYB document extraction;
sanctions and PEP screening;
AML alert triage;
complaint classification and handling;
cross-border tax-awareness guidance;
product matching and comparison;
meeting-note generation;
regulatory evidence-pack assembly;
vulnerable-customer handling;
fraud escalation;
credit and lending pre-checks;
customer onboarding checklists;
insurance claims triage;
underwriting pre-assessment.

Amazon Bedrock AgentCore works with open-source frameworks. AgentCore Gateway can convert APIs, Lambda functions, and existing services into MCP-compatible tools and make them available through Gateway endpoints. AgentCore Runtime supports multiple open-source agent frameworks for agent-to-agent orchestration.

The governance point:

Skill-level governance reduces the Model Risk Management review surface.

Instead of approving a whole agent every time, firms can approve changes to discrete skills, tools, prompts, retrieval policies, and model configurations. That creates a smaller blast radius, faster release cadence, and clearer audit trail.

4. Evaluation and Experimentation: Measurable Improvement, Not Vibes

The fourth lever is evaluation.

Every interaction can produce telemetry: what the user asked, what was retrieved, what tools were called, what answer was given, whether sources were cited, whether a guardrail fired, whether the user accepted the answer, whether the case was escalated, and whether human review confirmed or corrected the output.

But telemetry alone is not enough. The agent improves only when telemetry is converted into structured evaluation and governed change.

A mature evaluation framework includes:

golden test sets;
regression tests;
grounding checks;
policy-compliance checks;
deterministic business-rule checks;
LLM-as-judge checks where appropriate;
human review for high-risk cases;
production trace review;
A/B or champion/challenger testing;
statistical release gates;
post-release monitoring.

Runtime safety and evaluation should also be kept conceptually separate.

Runtime safety includes guardrails, denied-topic controls, PII detection, contextual grounding, prompt-attack protection, tool permissioning, and human approval for high-risk actions.

Evaluation measures whether the agent is improving, regressing, or behaving differently across versions.

Amazon Bedrock Guardrails includes Automated Reasoning checks, which became generally available in August 2025. AWS describes Automated Reasoning checks as using formal reasoning techniques to validate model outputs against encoded policies, and AWS reports up to 99% verification accuracy. That is useful for well-defined encoded policies, but it is not a blanket guarantee that all financial advice or operational decisions are correct.

This distinction is important:

Guardrails and Automated Reasoning are runtime safety controls. Evaluations and optimisation are improvement controls. Both are needed.

The governance point:

LLM-as-judge is useful, but it should not be the only evaluation mechanism for regulated decisions.

It must be combined with deterministic checks, policy tests, human review, and trace-based investigation.

5. Offline Model Optimisation: Fine-tuning, Distillation, and Model Selection

The fifth lever is offline model optimisation.

This is the only lever where the model itself genuinely changes. In a regulated firm, model change should happen offline, on a scheduled and approved cadence, using curated data and controlled experiments.

There are three main patterns:

Model selection — choosing the best available model for a task.
Fine-tuning — adapting a model to a firm-specific task or style.
Distillation — transferring behaviour from a larger teacher model into a smaller, faster, cheaper student model.

A growing production pattern is to use smaller, specialist, fine-tuned or distilled models for well-scoped, high-volume tasks, while reserving larger frontier models for complex reasoning, open-ended tasks, or the teacher role in distillation.

This is a model-portfolio approach: multiple models, each matched to a task and risk tier.

Amazon Bedrock Model Distillation became generally available in May 2025. AWS reports that distilled models can be up to 500% faster and up to 75% less expensive than original models, with less than 2% accuracy loss for RAG-style use cases. Those are AWS-reported benchmarks and should be validated on each firm’s own workloads and golden datasets before being used in business cases or regulatory submissions.

Bedrock native fine-tuning supports specific model variants rather than entire model families. The current documentation lists supported models including selected Amazon Nova models, Amazon Nova Canvas, Amazon Titan Image Generator, Amazon Titan Multimodal Embeddings, Anthropic Claude 3 Haiku, and Meta Llama 3.1-3.3 variants, with Region support varying by model. Firms should always verify the current supported-model list before committing to a design.

Amazon Bedrock reinforcement fine-tuning currently documents Amazon Nova 2 Lite support, with specific dataset and prompt limits documented by AWS.

The governance point:

Fine-tuned or distilled models need model cards, training-data lineage, benchmark results, bias and fairness review, red-team results, privacy review, regional deployment approval, rollback plans, and periodic revalidation.

This is governed model change, not live learning from customers.

Why This Pattern Works for Global Financial Services

This pattern is approvable because it avoids uncontrolled self-learning.

Under the EU AI Act, high-risk AI systems — including credit scoring and regulated advice — must meet governance obligations from August 2026, covering risk assessment, dataset quality, activity logging, human oversight, and technical documentation. The flywheel aligns with those obligations by design: structured telemetry supports activity logging, evaluation supports risk assessment, controlled release supports human oversight, and lineage on skills, knowledge, and model artefacts supports technical documentation.

1. Change happens at skill and configuration level

Improving one skill does not require rebuilding the entire agent. This reduces review scope and blast radius.

2. The model is not silently retrained

The foundation model does not update during inference. Model changes happen through scheduled model selection, fine-tuning, distillation, or replacement.

3. Data use is explicit

Customer data may be used for memory, retrieval, evaluation, or approved customisation only under defined policy. It should not silently train shared foundation models.

4. Tools are permissioned

Agent access to systems should be mediated through identity, entitlement, least privilege, approval flows, network controls, and audit.

5. Runtime safety is layered

Guardrails, Automated Reasoning checks, tool controls, grounding checks, and human approval patterns provide runtime protection.

6. Release is evidence-based

A/B testing and champion/challenger deployment can support improvement, but release gates must include auditability, approval evidence, rollback, and post-release monitoring.

7. Segregation is explicit

Feedback from one market, business line, or client segment should not automatically cross into another. Data should be separated by jurisdiction, business purpose, client type, and access entitlement.

8. Rollback is routine

Every release should be reversible: prompt version, skill version, tool version, retrieval configuration, guardrail policy, model alias, and deployment configuration.

9. Decision-level traceability is essential

Every agent response should be linkable to the prompts used, retrieval hits, tool outputs, guardrail decisions, model version, and relevant approval evidence. This supports second-line review, incident investigation, and regulator-facing explanation.

Misconceptions to Avoid

Misleading phrase	Safer phrase
“The model learns from every customer interaction.”	“The agent system improves through governed updates to memory, knowledge, skills, tools, evaluations, and approved offline model optimisation.”
“The agent gets smarter automatically.”	“The agent improves deliberately through measured, approved releases.”
“Customer data trains the model.”	“Customer data may personalise the governed agent experience through scoped memory and retrieval; it does not silently train shared foundation models.”
“Continuous learning means continuous model updates.”	“Continuous improvement means continuous monitoring and evaluation, with controlled releases and scheduled model updates.”
“The agent knows the customer personally.”	“The agent uses consented, purpose-limited customer context.”
“A/B testing lets us optimise automatically.”	“A/B testing provides evidence for release decisions, subject to governance and audit requirements.”
“Automated Reasoning proves the answer is always right.”	“Automated Reasoning can validate outputs against encoded policies for defined cases; it does not replace end-to-end risk controls.”

Practical Architecture Guidance

For global financial-services environments, I would summarise the architecture pattern as follows:

Start with the flywheel, not the model. Define how interactions become telemetry, telemetry becomes evaluation, evaluation becomes diagnosis, diagnosis becomes controlled improvement, and improvement becomes approved release.
Invest in context engineering. The quality of the agent depends on what context it receives: memory, policy, documents, tool outputs, risk rules, and user intent.
Build skills, not agent sprawl. Package repeatable capabilities as governed skills that can be reused across multiple agents.
Treat memory as regulated data. Memory needs consent, purpose limitation, minimisation, retention, deletion, residency, and entitlement controls.
Curate RAG like a regulated knowledge product. Retrieval quality depends on source quality, lineage, ownership, freshness, and access control.
Separate runtime safety from improvement controls. Guardrails protect runtime behaviour. Evaluations measure quality. Optimisation changes the system. Each has a different governance role.
Use model portfolios. Smaller specialist or distilled models can serve well-scoped tasks; larger models can handle complex reasoning or serve as teacher models.
Validate service-level controls before production. Data residency, KMS customer-managed keys, private connectivity, CloudTrail coverage, logging, and regional availability should be validated per service, Region, and feature before regulated production use.
Do not use Preview capabilities blindly in regulated production. Preview services can be valuable for experimentation, but auditability, CloudTrail coverage, regional availability, encryption support, and operational controls must be validated before regulated production use.

Conclusion

The right way to explain agent improvement in financial services is not to say that agents “learn from every customer.”

They do not need to.

A well-designed agent improves because the system around the model improves: memory becomes more useful, knowledge becomes better curated, skills become more reliable, tools become safer, evaluations become more precise, and model changes happen through controlled offline processes.

An agent at the centre surrounded by six improvement levers — memory, knowledge, skills, tools, evaluations, and offline model optimisation — each pointing inward toward the agent.

For global financial-services firms, that is the right pattern:

continuous improvement without uncontrolled model drift;
personalisation without silent training;
automation without losing auditability;
innovation without bypassing governance.

The future of agentic AI in financial services is not a single autonomous super-agent. It is a governed platform of reusable skills, controlled tools, engineered context, measurable evaluations, and approved model optimisation.

That is how AI agents improve over time in a way that a regulated financial-services firm can actually trust.

Source Notes

Public AWS documentation and announcements consulted. Verify the latest versions before regulatory or external use.

Amazon Bedrock and data protection

Amazon Bedrock AgentCore

Bedrock Knowledge Bases, Guardrails, and model customisation

#Financial Services #Governance #Amazon Bedrock #AgentCore #Model Risk Management

Questions or feedback? Reach out on LinkedIn