Seven questions to ask any AI vendor — and what a good answer sounds like
Every vendor says the model is accurate. These seven questions separate the ones who can prove it from the ones who are just loud — with the weak answer and the checkable one, side by side.
Buying AI is hard because everyone sounds the same. The demos dazzle, the benchmarks are quoted selectively, and every vendor is “state of the art.” Here are seven questions that cut through it. For each, the answer that should make you nervous, and the answer you can actually check.
1. How do you measure whether the model is right?
Weak: “We score 9X% on [benchmark].” Checkable: “We score every probabilistic claim against ground truth with a proper scoring rule, and we publish the misses.” A single accuracy number on a curated benchmark tells you almost nothing about behaviour in your domain.
2. When it’s confident, how often is it actually right?
This is the calibration question, and it’s the one most vendors can’t answer. A model that says “95% sure” and is right 70% of the time will quietly talk your users into bad decisions. Checkable: a calibration curve or Brier score, not a vibe.
3. Can I reproduce your headline number myself?
Weak: “Here’s the slide.” Checkable: “Here’s a hash-locked record you can re-derive in your own browser.” If a result only exists inside their deck, it isn’t a result — it’s a claim.
4. Who pays to be rated?
Ask whether the benchmark is issuer-pay. If a model can pay to appear, the ranking is an ad. Checkable: an explicit anti-issuer-pay policy.
5. Show me a miss.
A vendor who can’t produce a public example of being wrong has either never measured it or is hiding it. Checkable: errors published next to the hits, on the record.
6. What happens when the model is unsure — does it tell me?
The most useful systems flag their own uncertainty instead of bluffing. Checkable: confidence you can take roughly at face value, because it’s been measured against outcomes rather than asserted.
7. What can you hand my diligence or risk team?
Weak: a reference call and a logo wall. Checkable: an evidence file your own people can re-compute — ideally mapped to whatever standard governs you.
We built AEQUARA to answer all seven the checkable way. The contrast itself — assert versus attest — is laid out dimension by dimension; the engine that scores any forecaster is the platform; and the artifact your risk team can re-derive is a Calibration Attestation. If a vendor can’t answer these, the safest assumption is that the answers aren’t flattering.