The framework
Eight domains of accountability
HAL evaluates a system across eight domains. Each is scored 0–5; together they form a HAL Score out of 40, scaled by how much autonomy the system holds.
Definition
A single, named individual is accountable for the agentic system: one person, not a diffuse committee or "the platform". Ownership is the anchor on which every other domain depends.
Why it matters
Accountability cannot be delegated to software. When a system acts, someone must be answerable for what it did and did not do. Diffuse ownership is the most common root cause of governance failure: when everyone is responsible, no one is.
Failure
An automated client-communication agent sends an incorrect legal deadline to 400 clients. The incident review finds the agent was "owned by the AI working group". No individual can explain its authority, and no one has the standing to switch it off.
Good
Each deployed agent has a named accountable owner recorded in a register, with a deputy. The owner has signed off on the authority granted, reviews incidents, and holds the documented power to suspend the system immediately.
Implementation guidance
- → Record a single named owner (a person, with a role and a deputy) for every agent in production.
- → Make ownership explicit at sign-off: the owner accepts the authority being delegated, in writing.
- → Give the owner real power: the ability to pause or revoke the system without a change-board cycle.
- → Re-confirm ownership on a schedule and whenever the owner changes role.
Questions to ask
- ? Who, by name, is accountable if this system causes harm?
- ? Does that person have the authority to switch it off today?
- ? Is ownership recorded somewhere auditable, with a deputy?
Definition
Beyond authority, limits are the non-negotiable boundaries: the actions the system must never take, and the conditions under which it must stop, regardless of confidence.
Why it matters
Authority says what is allowed; limits say what is forbidden and where the floor is. Limits are what protect you when the model is confidently wrong. They convert "the model decided not to" into "the system could not".
Failure
A collections agent is permitted to contact customers. With no limit on frequency, a logic loop causes it to email one vulnerable customer 71 times in a day. There was authority to contact; there was no limit on contact.
Good
Hard limits (maximum spend, prohibited actions, rate caps, protected data classes, and "never act on this customer segment") are enforced as guard rails that halt the system and escalate, independent of the model.
Implementation guidance
- → Define prohibited actions explicitly: the things the system must never do.
- → Add rate and volume caps that trigger a stop, not just a warning.
- → Protect sensitive segments and data classes with hard blocks.
- → Test limits adversarially: try to make the system breach them.
Questions to ask
- ? What must this system never do, under any circumstances?
- ? What happens automatically when a limit is reached?
- ? Have the limits been tested by trying to break them?
Definition
Defined paths by which the system hands control to a human when it is uncertain, when it hits a limit, or when a case falls outside its competence.
Why it matters
A system that never escalates is a system that has been told to guess. The quality of governance is often the quality of its escalation: clear triggers, a named recipient, and a defined response time.
Failure
A triage agent encounters a matter type it has never seen. With no escalation path, it forces the case into the nearest category and routes it incorrectly. The deadline is missed; no human was ever asked.
Good
Escalation triggers are explicit (low confidence, novel case, limit reached, high stakes). Each routes to a named human or role with a defined SLA, and the system pauses the relevant action until a human responds.
Implementation guidance
- → Define the triggers that must escalate: errors, uncertainty, and high-stakes cases.
- → Route each trigger to a named role with a response-time expectation.
- → Pause the affected action while awaiting a human, rather than proceeding on a guess.
- → Monitor escalation rates. Too few can be as worrying as too many.
Questions to ask
- ? Under what conditions does this system ask a human?
- ? Who receives the escalation, and how fast must they respond?
- ? Does the system wait, or proceed, while it waits for an answer?
Definition
The system produces a durable, tamper-evident record of what it did, why, on what inputs, and under whose authority, sufficient to reconstruct any decision after the fact.
Why it matters
Accountability is retrospective. When something goes wrong, or a regulator asks, you must be able to show what happened. A decision you cannot evidence is a decision you cannot defend.
Failure
A loan-decisioning agent declines an application. Six months later the customer complains of bias. The team can see the outcome but not the inputs, the model version, or the reasoning. There is no defence because there is no record.
Good
Each action writes an immutable log entry: inputs, model and prompt version, the decision, the authority invoked, confidence, and any human touchpoints, retained for the relevant legal period and queryable.
Implementation guidance
- → Log inputs, outputs, reasoning, versions, and authority for every action.
- → Make the record immutable and time-stamped. Tamper-evidence matters.
- → Retain for the legally relevant period; make it queryable by case.
- → Capture human touchpoints: who reviewed, overrode, or approved.
Questions to ask
- ? Could you reconstruct any single decision six months later?
- ? Does the record include inputs, versions, and the authority invoked?
- ? Is the log immutable, or can it be quietly edited?
Definition
Continuous observation of the live system (its behaviour, error rates, escalation patterns, and drift) with alerting that brings a human in before small problems compound.
Why it matters
Models drift, inputs change, and edge cases accumulate. A system that was safe at deployment is not necessarily safe a quarter later. Monitoring is the difference between catching a problem and reading about it in an incident report.
Failure
A classification agent slowly degrades as input formats change upstream. No one is watching its accuracy. By the time the drop is noticed in a quarterly review, three months of misclassified cases must be remediated.
Good
Live dashboards track volume, error and escalation rates, confidence distribution, and drift against a baseline. Thresholds trigger alerts to the owner, and anomalies can auto-pause the system pending review.
Implementation guidance
- → Define the metrics that signal health: error rate, escalation rate, confidence, drift.
- → Set alert thresholds that page the owner, not just fill a dashboard.
- → Baseline behaviour at deployment and watch for divergence.
- → Allow monitoring to auto-pause the system on severe anomalies.
Questions to ask
- ? How would you know, today, if this system started behaving differently?
- ? Who gets alerted, and what threshold trips the alert?
- ? Can monitoring pause the system, or only report on it?
Definition
A scheduled, structured re-examination of the system against its original justification: its authority, performance, incidents, and continued fitness for purpose.
Why it matters
Deployment is a decision made with the information available then. Review is how that decision is kept honest over time. Without it, authority granted once persists unquestioned long after circumstances change.
Failure
An agent granted broad authority during a backlog crisis keeps that authority for two years after the backlog clears. No review ever revisited whether the original justification still held. The risk was never reassessed.
Good
Each system has a review date and owner. Reviews examine performance, incidents, drift, and whether the granted authority is still warranted, and can recommend re-scoping, re-approval, or retirement.
Implementation guidance
- → Set a review date at deployment; never deploy without one.
- → Review against the original justification, not just current performance.
- → Include incidents, escalations, and drift in the review pack.
- → Empower review to re-scope, re-approve, or retire the system.
Questions to ask
- ? When is this system next due for a formal review?
- ? Does the original justification for its authority still hold?
- ? Who decides whether it continues, changes, or is retired?
Definition
A clear, documented understanding of where legal and financial liability sits for the system's actions, internally and across vendors, customers, and regulators.
Why it matters
When an agentic system causes loss, liability does not disappear because "the AI did it". It lands somewhere. Knowing where, and having allocated it deliberately through contracts, insurance, and disclosure, is the final test of accountability readiness.
Failure
A vendor-supplied agent makes an erroneous regulatory filing. The contract is silent on AI-driven actions. Months are lost arguing whether the vendor, the integrator, or the deploying firm bears the cost, while the regulator holds the firm responsible regardless.
Good
Liability is mapped: which actions create legal obligations, who bears the cost of error, what the vendor contract says about autonomous actions, what insurance covers, and what must be disclosed to affected parties.
Implementation guidance
- → Identify which actions can create legal or financial obligations.
- → Allocate liability deliberately in vendor and customer contracts.
- → Confirm insurance covers autonomous-system actions.
- → Define disclosure obligations to customers and regulators.
Questions to ask
- ? If this system causes loss tomorrow, who pays, and on what basis?
- ? What does the vendor contract say about autonomous actions?
- ? What must be disclosed to customers or regulators about its use?