How Testly Ensures Test Accuracy and Integrity
A commitment to fair, reliable AI literacy assessment
At Testly, we design our assessments to measure real AI literacy, not familiarity with specific tools, prompts, or memorized answers. Because AI usage evolves rapidly, traditional static tests are no longer sufficient. This document explains, at a high level, how we ensure accuracy, fairness, and resistance to manipulation, without exposing internal mechanisms that could compromise test integrity.
Research Foundation
Testly's assessment framework is grounded in recent research (2022-2025) from leading institutions and organizations, including McKinsey, BCG, MIT Sloan, and validated through real-world implementations at organizations like JPMorgan Chase, Unilever, and Amazon.
Our approach builds on documented findings:
- β’ Organizations with comprehensive AI literacy programs achieve 2-4x ROI within 18-24 months
- β’ Productivity gains range from 20-60% depending on employee competency level
- β’ AI leaders experience 1.5x higher revenue growth compared to peers (BCG research)
- β’ Skill progression follows clear, measurable patterns across roles and industries
Foundation principles:
- β’ Competency assessment must measure judgment and decision-making, not theoretical knowledge
- β’ Different levels require qualitatively different capabilities, not just "more of the same"
- β’ Real-world application matters more than tool-specific expertise
- β’ Long-term validity requires resistance to pattern learning and memorization
Four-Level Competency Framework
Testly assesses AI literacy across four distinct levels. Each represents a qualitative shift in capability, not simply increased knowledge.
Level 1: Competent (Foundation)
Basic awareness and supervised execution. Users can complete simple, well-defined tasks with guidance.
Characteristics: Recognizes AI capabilities and limitations, follows established procedures, requires regular support
10-15% productivity gain through task automation
Level 2: Proficient (Intermediate)
Operational independence and systematic integration. Users optimize workflows and work without constant supervision.
Characteristics: Creates structured prompts, applies critical evaluation, integrates AI into complex processes, shares knowledge with peers
20-30% productivity gain through workflow optimization
Level 3: Adaptive (Advanced)
Innovation and process transformation. Users redesign work fundamentally and mentor others.
Characteristics: Develops custom solutions, leads implementation projects, creates organizational frameworks, drives cultural change
30-50% efficiency gains through process transformation
Level 4: Strategic (Expert/Leader)
Strategic influence and organizational transformation. Leaders shape AI strategy and culture at scale.
Characteristics: Develops organizational AI strategy, establishes governance frameworks, influences executive decisions, demonstrates thought leadership
40-60%+ organizational efficiency and competitive advantage
Qualitative differentiation:
Progression between levels represents fundamental shifts in thinking and impact, not incremental improvements. A Level 2 user doesn't just know "more" than Level 1βthey approach problems differently, make different types of decisions, and create different kinds of value.
1. What "accuracy" means in AI literacy testing
For Testly, accuracy does not mean trivia recall or theoretical knowledge. It means:
- Measuring judgment, not rote answers
- Evaluating how people reason with AI outputs, not how well they know AI terminology
- Distinguishing between levels of practical competence, from basic use to strategic thinking
An accurate test is one where:
- β’ the result reflects real-world behavior,
- β’ the score remains meaningful across time,
- β’ and the assessment cannot be "gamed" by learning patterns.
2. Separation of roles: generation is not evaluation
A core design principle is separation of concerns.
- Content is generated dynamically
- Evaluation follows independent validation rules
- No single component determines outcomes on its own
In simple terms: no scenario is trusted just because it was generated.
This separation prevents arbitrary or biased scoring, single-point failures, or uncontrolled drift in item quality.
3. Multiple layers of quality control
Every test item passes through multiple independent checks before being used. These controls verify that:
β The scenario is realistic and work-relevant
β The question truly requires judgment
β Answer options are plausible and balanced
β No option is obviously "signaled" as correct
Items that do not meet quality criteria are automatically adjusted or discarded. This process runs continuously, not as a one-time review.
Validity & Reliability
Test validity means the assessment actually measures what it claims to measure. Testly ensures this through:
Construct Validity
Items are designed to test real-world judgment and decision-making, aligned with behaviors observed in successful AI users across industries
Predictive Validity
Assessment results correlate with on-the-job performance and productivity gains documented in organizational implementations
Reliability
Consistent results across time and contexts. Dynamic generation ensures items remain fresh while maintaining measurement consistency
Evidence-based metrics:
- β’ Level progression aligns with documented productivity gains from industry research
- β’ Competency indicators match behaviors validated in organizational case studies
- β’ Assessment outcomes predict success in AI-enabled roles
- β’ Results remain stable and meaningful as AI tools evolve
4. Protection against memorization and pattern learning
Testly assessments are designed so that:
- Items are not static
- Answer patterns are not repeatable
- Knowing previous questions does not help with future ones
Because scenarios are varied and regenerated within controlled boundaries:
- β’ There is no fixed question bank to memorize
- β’ No answer key that can be leaked
- β’ No shortcut to higher scores without genuine competence
This ensures long-term validity, even at scale.
5. Balanced difficulty and fair scoring
To avoid distorted results, Testly actively monitors and controls for:
The goal is not to "trap" users, but to ensure that success reflects understanding, and failure reflects genuine gaps, not trick questions.
Industry Benchmarking & Standards
Testly's framework aligns with established best practices and observed patterns from leading organizations:
Industry validation:
- β’ JPMorgan Chase: JPMorgan Chase: 200,000 employees trained, 20% sales increase in AI-enabled roles
- β’ Unilever: Unilever: 23,000 employees trained, 70,000 person-hours saved
- β’ Amazon: Amazon: 250,000+ employees trained through career development programs
- β’ BCG: BCG: Generated $2.7B AI revenue (20% of total) from zero in 2 years
Our assessment framework reflects patterns observed across these implementations:
- β’ Clear skill progression from basic execution to strategic leadership
- β’ Measurable productivity gains at each competency level
- β’ Emphasis on judgment and decision-making over tool knowledge
- β’ Long-term skill development requiring 18-36 months for full maturity
Alignment with recognized standards:
- β’ Competency-based assessment methodology
- β’ Multi-level progression frameworks
- β’ Real-world performance correlation
- β’ Continuous validation and improvement cycles
6. Human judgment remains central
Although AI is used to support scale and diversity, AI does not replace human judgment in test design.
Human expertise defines:
- β’ What is being measured
- β’ Which behaviors indicate competence
- β’ Where the boundaries between levels truly lie
AI supports this process, but does not redefine it autonomously.
7. Continuous monitoring and improvement
Test integrity is not a one-time achievement.
Testly continuously analyzes:
- Acceptance and rejection patterns
- Item performance trends
- Consistency across different roles and contexts
When anomalies appear, they are investigated and corrected. This ensures that the assessment remains stable, fair, and aligned with real-world AI usage.
8. What we explicitly do not do
To maintain trust and validity, Testly does not:
- βReuse fixed question sets
- βRely on single-pass AI generation
- βExpose scoring logic or answer patterns
- βOptimize tests for speed at the expense of accuracy
- βAllow external tools to predict outcomes reliably
In summary
Testly assessments are built around one core principle:
AI literacy cannot be tested by static questions or memorized answers.
It must be assessed through judgment, context, and real-world decision-making.
Our approach combines dynamic generation, layered validation, and continuous oversight to ensure that results are accurate, fair, and resistant to manipulation β without sacrificing transparency or trust.