AI behaviour detection accuracy: what the numbers actually mean
Behaviour detection accuracy figures get thrown around without context. Here's what 94% accuracy actually means, why precision matters more than recall, and how to evaluate vendors honestly.
The number on the slide is not the number
Almost every AI behaviour detection vendor has a slide with a single big number on it. 94% accuracy. 97% accuracy. Sometimes 99%. The number means something. It also frequently means less than the slide implies.
Three things almost always go missing. What environment the number was measured in. Whether it is precision or recall. And how it was validated. Without those three pieces of context, the headline is marketing, not engineering.
This is what the numbers actually mean, how to read them, and what to ask vendors so you can compare honestly.
Precision vs recall, in plain English
Two different things get called "accuracy" in this market. They measure opposite failure modes.
Precision. Of the alerts the system fires, what percentage are correct? A high-precision system rarely cries wolf. Low precision means your team gets alerts that turn out to be nothing.
Recall. Of the things the system should have caught, what percentage did it catch? A high-recall system rarely misses incidents. Low recall means your team finds out about incidents after the fact, from somewhere other than the system.
You can be 99% precise and 60% recall. You can be 60% precise and 99% recall. They are not interchangeable. They are both important. They trade off against each other.
The trade-off most vendors do not explain
Tuning a behaviour detection system is largely about deciding where you sit on the precision-recall curve. Tune it to fire on anything that might be relevant, and recall goes up while precision goes down. Tune it to only fire when the system is highly confident, and precision goes up while recall goes down.
For most venue environments, the right setting is high precision. The reason is human, not technical. A security team that gets ten false alerts in a shift stops trusting the system. A team that gets two real alerts in a shift treats every alert as real.
The right tuning question to ask a vendor is not "what is your accuracy?". It is "what is your precision in environments comparable to mine, and what is the false-positive rate per camera per day?".
Why the environment matters
A behaviour detection model trained and tested on a dataset of bar fights will perform very differently in a hospital corridor. The environment changes the underlying behavioural baseline.
The same is true within a single venue. A hotel lobby at peak check-in has a different behavioural baseline from the same lobby at 3am. A construction site at shift change has a different baseline from the same site at lunchtime. A model that does not learn these contextual baselines will fire constantly during peaks and miss subtle behaviours during quiet periods.
The right systems use a learning period to establish baselines per camera, per time-of-day band, per day-of-week pattern. After that learning period (typically 14 to 30 days), accuracy in the live environment usually exceeds the dataset benchmarks. The honest vendor will quote you both numbers: dataset accuracy and post-learning accuracy.
The validation question
How was the accuracy measured? Three common methods, each tells you something different.
Held-out test set. The vendor trained the model on one dataset and tested it on another. This is the standard academic approach. It is honest, but it tells you about model performance in the dataset, not in your environment.
Free Download
Get the Martyn's Law Compliance Checklist
A step-by-step checklist covering everything your venue needs before April 2027. Free. No signup required beyond your email.
Live shadow deployment. The vendor ran the model alongside a human observer for a period in a real environment. The observer's judgements were treated as ground truth. This is closer to the real world but depends on the observer being accurate.
Customer-reported. The vendor reports accuracy based on what customers self-report. This is the weakest because customers do not always know about incidents the system missed. You are getting the visible accuracy, not the actual accuracy.
The strongest vendors will run a shadow deployment in your environment before you commit. If a vendor will not do that, treat the headline number with caution.
Different behaviours have different accuracy ranges
Not every behaviour is equally detectable. Some categories are well-defined and well-trained. Others are inherently fuzzier.
From our published platform metrics:
- Drink spiking detection: 98.7% precision. The behaviour is well-defined (hand over drink, specific movement patterns) and the training data is now extensive.
- Aggression detection: 94.2% precision. Subtle in the pre-conflict window but unambiguous once the fight starts.
- Exclusion zone breach: Above 99% in defined environments. The behaviour is geometric: was the person inside the zone or not?
- Unattended item detection: Variable. Depends heavily on environment. Better in lobbies than busy retail floors.
- Crowd density estimation: Typically 5 to 8% error margin on absolute counts. Better as a relative-change signal than an absolute headcount.
The honest answer is that headline single accuracy numbers mask huge variation by behaviour. The buyer-friendly version is to look at performance on the three or four behaviours you actually care about for your venue type.
What good looks like in practice
For a UK hospitality or construction deployment after 90 days of in-environment tuning, what we see consistently:
- Precision above 92% across the main detection categories
- False positives below 1 per camera per 24-hour shift
- Recall (where measurable against known incidents) above 90%
- End-user trust scores above 8 out of 10 (the system is taken seriously)
If those numbers are not where a vendor will agree to be measured after 90 days, you should ask why.
The single best question to ask any vendor
"Would you be willing to put a precision and false-positive commitment in the contract, with a defined exit if you do not hit it after 90 days?"
Vendors who will say yes are confident in their numbers. Vendors who hedge are usually quoting marketing accuracy, not engineering accuracy. The answer to this question filters the field fast.
How we report it at Archangel
Our published platform numbers (98.7% drink tampering, 94.2% aggression, sub-2-second detection) are post-learning precision in real venue environments, not held-out test sets. We will commit to those numbers in deployment.
If you want to see the system run on your venue before committing, the two-month free trial includes a tuning period where we agree precision and false-positive targets in writing.
Related reading
What is Martyn's Law? A Complete Guide for Venue Operators (2026)
Compliance · 10 min read
Motion Detection vs Behaviour Detection: What's the Difference?
Technology · 7 min read
How AI Detects Drink Spiking in Bars and Venues
Technology · 7 min read
AI CCTV for Construction Sites: A Complete Guide
Technology · 9 min read
See Archangel AI in action
Book a personalised demo and discover how intelligent protection works for your venues.
Free consultation. Works with any CCTV system. Live in under 48 hours.