Why does my site fail an automated scan but look fine?

Scanners evaluate the DOM and computed styles, not the rendered visual outcome a sighted user perceives. Decorative styling, sufficient contrast at the rendered pixel, or visually obvious labels can all pass human inspection while failing rule checks that only see the underlying HTML. Some of these are genuine defects (an assistive-tech user gets a worse experience than you do); others are false positives worth dismissing with a note. Either way, “looks fine” is not the WCAG conformance bar.

Can a scanner certify my site as WCAG compliant?

No. No scanner — SweepHound included — can certify WCAG conformance. Per W3C’s test-evaluate guidance, automated tools can find some but not all accessibility issues. Conformance requires evaluating success criteria that depend on meaning, sequence, and assistive-technology behavior, which is manual work. Anyone selling automated certification is either redefining the word or overpromising.

Should I run both axe and IBM Equal Access?

Yes, if you can. The two engines have meaningfully different rule sets and surface different violations. Running both widens the deterministic detection floor at no additional false-negative cost. SweepHound runs them in parallel and dedupes the overlap so you see one finding per actual issue rather than two. If you can only run one, axe-core is the broader default; IBM Equal Access is stronger on ARIA pattern checks.

What about AI-powered scanners that claim 95% coverage?

Be skeptical. Coverage percentages depend entirely on the denominator. Deque reported automated tests identified 57.38% of issue instances by volume in its dataset (2,000+ audits, 13,000+ pages, ~300K issues) — that is the most generous published figure from the engine vendor itself. Numbers above ~60% typically count issues that a manual WCAG auditor would not count, or measure against a self-selected subset of success criteria. Ask any vendor citing 90%+ for their methodology, denominator, and external audit data before believing it.

How often should I re-scan?

For active sites, weekly is a reasonable cadence; for sites with daily content publishing (commerce, news, marketing), daily scans on the changed sitemap catch regressions before they compound. The cheapest mistake is scanning once at launch and never again — content-team edits and third-party widget updates introduce new violations all the time. Manual smoke testing should run quarterly, with a full WCAG audit annually.

Testing & Scanners

What Automated Accessibility Scanners Miss

Last updated May 24, 2026

Every accessibility vendor will sell you a scan. Almost none of them will tell you what their scanner cannot see. SweepHound runs two engines — axe-core and IBM Equal Access Checker — and we still cannot honestly promise to find every WCAG violation on your site. Nobody's scanner can. Per Deque's Automated Accessibility Coverage Report, Deque reported automated tests identified 57.38% of issue instances by volume in its dataset — based on 2,000+ audits across 13,000+ pages and ~300K issues. That is a share of issue *volume*, not a share of WCAG criteria that automation can fully verify, and it is the most generous published number from the engine vendor itself.

Even the W3C's official guidance says it plainly: automated tools can find some but not all accessibility issues. The judgment-dependent layer — meaningful alt text, keyboard flow, screen-reader announcements, focus visibility, error recovery, reading order — is the slice that decides whether a real person can actually use your site. This page is about exactly that layer: what gets missed, why, and what to do instead of pretending it does not exist. If you want the general overview first, our automated-vs-manual testing guide is the gentler starting point; this doc is the deep complement.

SweepHound's response is unglamorous: run two engines instead of one, group findings so the manual review is finite, and ship a remediation playbook that flags which fixes still need a human eye. We will never tell you a green scan means a compliant site. We will tell you what the scan covered, what it did not, and where to look next.

What scanners can verify deterministically

Automated rules excel at deterministic, machine-decidable checks — questions with a single correct answer the engine can compute from the DOM, computed styles, or the accessibility tree. These are the checks where humans tend to get bored and inconsistent, which is exactly where scanners shine.

Color-contrast math — text against background, computed from CSS values. Scanners pass or fail 4.5:1 (normal text) and 3:1 (large text or UI components) with zero ambiguity once the colors are resolvable.
Alt-text presence — every <img> either has an alt attribute or it does not. Same for SVG <title> and aria-label on icon buttons.
Form-label association — an <input> either has a matching <label for="…">, an aria-labelledby, or it does not. Machine-decidable.
Heading rank issues — skipping from h1 to h4, multiple h1 elements outside the intended landmark, empty headings.
ARIA attribute validity — unknown role names, attribute values that are not in the allowed enum, attributes used on the wrong role, conflicting aria-hidden on focusable elements.
Structural plumbing — duplicate IDs in the DOM, missing <html lang>, buttons and links with no accessible name, interactive elements missing required roles, tab order broken by tabindex values greater than zero.

For context on how widespread these mechanical issues still are: per WebAIM Million 2025, the average errors per home page was 56.8 (2024) → 50.96 (2025). Next.js pages averaged 38.6 errors, 24.2% below average; Shopify averaged 69.6; WooCommerce 75.6; Magento 85.4. The 2026 report is published and shows the average ticking back up toward 56.1/page — the deterministic-check problem is not getting solved on its own. These are the issues a good scanner will catch. They are also, broadly, the easier half.

What scanners cannot evaluate

The judgment-dependent layer is everything that requires meaning, context, sequence, recovery, and perceived quality. A scanner that claims to evaluate these is either over-promising or quietly shipping false confidence. Here is the honest breakdown by category.

Meaning and context

Is the alt text actually meaningful? alt="image" and alt="IMG_2387.jpg" both pass the presence check and both fail the user.
Is link text meaningful out of context? Screen-reader users often navigate by pulling up a flat list of links — six instances of “Read more” and “Click here” defeat the point.
Is the page heading structure logically organized? The rank order can be valid while the outline tells a confusing story.

Keyboard usability

Most scanners check that interactive elements have a role and a tab stop. They do not Tab through your page in order and ask whether the experience is sane.

Does Tab order match visual order, or does it jump from the header to a hidden sidebar to a third-party widget and back?
Can you reach every interactive element with the keyboard alone — including custom dropdowns, date pickers, and modal close buttons?
Is there a keyboard trap inside a modal, mega-menu, or embedded iframe that prevents the user from leaving?
Do custom widgets implement the expected keyboard pattern (Esc to dismiss, arrow keys inside menus, Home / End in listboxes)? See our keyboard navigation guide for the manual checklist.

Screen-reader experience

A scanner can confirm a landmark exists. It cannot tell you whether the page sounds right.

How does the page actually announce on NVDA, VoiceOver, and JAWS? The same DOM can sound clear on one and garbled on another.
Are dynamic updates — cart totals, form errors, toast notifications, search-result counts — announced via the right aria-live politeness?
Are heading and landmark labels useful? “Region” and “Navigation” three times in a row tells the user nothing.
Does the reading order announced by the screen reader match the visual order? See our screen-reader testing guide for a 15-minute smoke test.

Visual focus

Are focus indicators actually visible against the content behind them? A 2px focus ring that disappears against your brand-colored hero passes the “outline is not none” check and fails real users.
Do focus styles meet WCAG 2.2 Focus Appearance — minimum 2 CSS pixels thick, 3:1 contrast against adjacent colors, and unobscured by other content? This requires layout-aware evaluation that general-purpose scanners do not perform reliably.
Does focus move predictably after asynchronous events — opening a modal, deleting a row, submitting a form?

Error recovery and flows

When a form errors, can a screen-reader user actually recover? Is the first invalid field focused? Is the error message programmatically associated with the input via aria-describedby?
Are checkout, payment, and authentication failures made obvious and unambiguous — not just visually red, but announced?
Can the user undo a destructive action, or is the “Are you sure?” dialog itself inaccessible?

Reading order vs. visual order

CSS Grid and Flexbox routinely produce layouts where visual order and DOM order diverge — order, flex-direction: row-reverse, and absolutely positioned content all create traps where what the eye sees and what the screen reader announces are different stories. No scanner can reliably score whether the difference is acceptable. The same issue surfaces in responsive layouts that re-order columns at narrow breakpoints — a sidebar that visually moves above the main content on mobile is still announced after main if the DOM was not reordered, which is jarring for screen-reader users who rely on visual cues from sighted collaborators.

Caption and transcript quality

A scanner can detect a missing <track kind="captions">. It cannot tell you the captions are auto-generated nonsense, out of sync, or missing speaker labels.
Audio descriptions for video content are a manual review item — their presence, accuracy, and synchronization cannot be measured programmatically.
Transcripts for podcasts and audio-only content are a content audit, not a code audit.

Why two engines beat one

axe-core and IBM Equal Access Checker are both reputable, both actively maintained, and both incomplete. They have meaningfully different rule sets, different default severities, and different coverage of ARIA design patterns. A rule that ships in one engine may not exist — or may exist with different thresholds — in the other. Red Hat's engineering team, who use both internally, summarize it as: running Equal Access Checker in conjunction with a tool like axe DevTools can be useful for detecting a wider array of violations than either tool alone.

That is the entire argument for SweepHound's dual-engine setup. Two engines do not get you to 100% — combining them does not approach total coverage, and we will not pretend otherwise — but they widen the deterministic floor. In our own dataset, the second engine consistently surfaces a meaningful set of issues the first engine alone would not have flagged: ARIA pattern misuse, content structure checks, and a different cut of contrast edge cases. We dedupe the overlap so you do not see the same issue twice, and we group by element so a single button does not generate four findings from four rules.

Some concrete examples of where the two engines diverge in practice: axe-core is the stronger default on color contrast and on heading and landmark rule logic, with the broadest community ruleset and the most active maintenance cadence. IBM Equal Access tends to be stricter on table semantics — header associations, summary attributes, layout-table misuse — and on the older ARIA design patterns where authoring practice has drifted from the spec. Equal Access also evaluates a handful of content checks (sensory characteristics, keyboard-only timing assumptions, consistent navigation) that axe-core deliberately leaves to manual review. Neither engine is “better” in the abstract; they are complementary, which is why we run both and why our findings list reflects the union rather than the intersection.

The honest framing: dual-engine raises the floor of automated detection. It does not lift the ceiling, because the ceiling is a human with a screen reader. Anything sold as “AI-powered 95%+ coverage” is almost certainly counting issues a real WCAG auditor would not count, or measuring against a different denominator. Ask vendors which study they are citing and what their definition of “coverage” is.

How to build a testing workflow that doesn't lie to you

A workflow that combines deterministic scanning with finite manual review is what gets you to defensible WCAG conformance. We recommend this six-step loop for every team shipping a public site:

1. Scan with two engines. Run a dual-engine automated scan on every push, or at minimum weekly, against your full sitemap — not just the homepage. Most production issues live in older templates and content-team pages, not the routes engineering is paying attention to.
2. Review grouped findings. Triage at the rule level, not the instance level. One missing label rule fixed in your form component closes hundreds of instances at once. SweepHound groups automatically so this is the default view.
3. Keyboard smoke test the top 5 pages. Unplug your mouse. Tab through your homepage, signup, key product page, checkout (or equivalent conversion path), and account settings. If any step feels confusing, that is a finding.
4. Screen-reader pass on critical flows. VoiceOver on macOS and NVDA on Windows are both free. Twenty minutes per quarter on your conversion path will surface more real issues than another month of CI scanning.
5. Publish an honest accessibility statement. Say what you tested, when, with which tools, and which known issues remain. Tell users how to report problems. WCAG conformance is a process; the statement is the receipt.
6. Re-scan on a schedule. Accessibility regresses every time a content editor pastes a new image without alt text or a developer ships a new component. Make re-scanning a calendar event, not a heroic effort.

Our manual accessibility checklist will live alongside this page once published — it covers the step-3 and step-4 checks in concrete script form.

Common false positives in automated scans

The other half of honesty is admitting that scanners — including ours — flag things that are not actually broken. A good remediation workflow distinguishes real defects from noise. The most common false-positive patterns we see:

Decorative images flagged as missing alt text — when an <img> is intentionally decorative and uses alt="", some rule configurations still surface it for review. The fix is to confirm the intent — not always to add alt text.
aria-label eaten by translation tools — automated translators sometimes mangle aria-label values, producing scanner warnings that are actually a translation pipeline bug rather than a markup bug.
Landmark duplication on intentional regions — multiple navigation landmarks on a page are valid if labeled distinctly. Rules that flag duplication without checking labels generate noise.
Placeholder used as visible label — the input has a programmatic label, so the rule passes. The user still cannot see the label once they start typing. This is the inverse case: it passes the machine and fails the user.
Link text weak in context — a link reading “Learn more” passes the “non-empty accessible name” check. It still fails the real-user test of being meaningful when announced out of context.
Contrast under transformed text — text with mix-blend-mode, gradient backgrounds, or overlaid video can be flagged at the base color but visually pass at the rendered pixel. SweepHound resolves many of these via solver-based contrast checking and promotes the others to explanation-only.

False positives are not a reason to ignore scanner output. They are a reason to triage with judgment rather than mechanical ticket creation. The right question is never “does this rule fire?” but “does this issue degrade the experience for an assistive-tech user?”

SweepHound's remediation engine handles the most common false positives automatically. Contrast rules that fire on a base color but pass at the rendered pixel resolve to explanation-only rather than a code fix. Control-name family issues are deduped so a single DOM element does not generate multiple overlapping fixes from different rules. And every finding carries enough surrounding context — preceding heading level, ancestor data attributes, computed styles, nearby text — that the human reviewer can make the meaning call quickly instead of paging through the live site. The goal is not zero false positives; the goal is a triage queue that respects your time.

Frequently asked questions

Why does my site fail an automated scan but look fine?: Scanners evaluate the DOM and computed styles, not the rendered visual outcome a sighted user perceives. Decorative styling, sufficient contrast at the rendered pixel, or visually obvious labels can all pass human inspection while failing rule checks that only see the underlying HTML. Some of these are genuine defects (an assistive-tech user gets a worse experience than you do); others are false positives worth dismissing with a note. Either way, “looks fine” is not the WCAG conformance bar.
Can a scanner certify my site as WCAG compliant?: No. No scanner — SweepHound included — can certify WCAG conformance. Per W3C’s test-evaluate guidance, automated tools can find some but not all accessibility issues. Conformance requires evaluating success criteria that depend on meaning, sequence, and assistive-technology behavior, which is manual work. Anyone selling automated certification is either redefining the word or overpromising.
Should I run both axe and IBM Equal Access?: Yes, if you can. The two engines have meaningfully different rule sets and surface different violations. Running both widens the deterministic detection floor at no additional false-negative cost. SweepHound runs them in parallel and dedupes the overlap so you see one finding per actual issue rather than two. If you can only run one, axe-core is the broader default; IBM Equal Access is stronger on ARIA pattern checks.
What about AI-powered scanners that claim 95% coverage?: Be skeptical. Coverage percentages depend entirely on the denominator. Deque reported automated tests identified 57.38% of issue instances by volume in its dataset (2,000+ audits, 13,000+ pages, ~300K issues) — that is the most generous published figure from the engine vendor itself. Numbers above ~60% typically count issues that a manual WCAG auditor would not count, or measure against a self-selected subset of success criteria. Ask any vendor citing 90%+ for their methodology, denominator, and external audit data before believing it.
How often should I re-scan?: For active sites, weekly is a reasonable cadence; for sites with daily content publishing (commerce, news, marketing), daily scans on the changed sitemap catch regressions before they compound. The cheapest mistake is scanning once at launch and never again — content-team edits and third-party widget updates introduce new violations all the time. Manual smoke testing should run quarterly, with a full WCAG audit annually.

How SweepHound's dual-engine plus manual checklist works

SweepHound's product is built around the honest framing on this page. Concretely, here is what you get:

Dual-engine scanning — axe-core and IBM Equal Access run in parallel on every scan. We dedupe the overlap and group by element so one button never produces four tickets.
Full-site crawling — we scan the entire sitemap, not just the home page. Most violations live in the long tail of templates and content pages, which is exactly where single-page scanners and CI checks have blind spots.
Deterministic remediation first — for mechanical fixes (alt-text presence, label association, contrast adjustments) we generate concrete code edits. We only fall back to an LLM when the fix is genuinely ambiguous, and we never let the model invent unsafe markup.
Manual review prompts on every scan — we tell you explicitly which findings need a human eye (meaning, focus order, screen-reader behavior) and link to the relevant manual-check guides. The scan report ends with the manual smoke-test list, not a green checkmark.
Authenticated scanning — pages behind a login wall are where commerce and SaaS accessibility actually lives. We support credentials and stored session state so the scanner sees what your real users see.
Honest accessibility statements — we generate a per-scan public statement listing what was tested, what issues remain, and how to report problems. No green-badge theater.

If you want to see what a dual-engine scan turns up on your own site, start a free scan — you get a full crawl, grouped findings, and the manual-review checklist on the free tier. Paid tiers add full-site authenticated scanning, scheduled re-scans, and the statement generator; see pricing for details.

Already convinced? Sign up and run your first scan in under a minute. We will tell you what we found and — just as importantly — what we could not.

Sources

Deque, Automated Accessibility Coverage Report — Deque reported automated tests identified 57.38% of issue instances by volume in its dataset (2,000+ audits across 13,000+ pages, ~300K issues).
WebAIM Million 2025 — Average errors per home page: 56.8 (2024) → 50.96 (2025). Next.js 38.6, Shopify 69.6, WooCommerce 75.6, Magento 85.4.
W3C WAI, Evaluating Web Accessibility Overview — Official W3C framing that automated tools can find some but not all accessibility issues.
IBM Equal Access Toolkit — Tools — The IBM Equal Access engine SweepHound runs alongside axe-core. Combining engines widens coverage but does not approach 100%.