AI Evaluations

A cross-org effort to define how GenAI is measured for quality, safety, and production-readiness.

Bedrock Evaluations answers the question every GenAI builder faces: does it actually work? Are the responses accurate, safe, and useful? This tool gives teams a consistent way to measure quality - through automated, human, and expert reviews both during development and after applications go live.

Role:
Principal Designer leading end-to-end GenAI evaluation strategy and feature delivery across 4 AWS teams.

Responsibility:

  • Led UX strategy for 30+  features.
  • Mentored 4 junior designers across 3 teams
  • Unified siloed teams under one eval strategy.
  • Aligned VP to CEO through workshops and design strategy.

Timeline:
Oct 2023 - Jan 2025

Problem statement

Unifying AWS Bedrock's Fragmented AI Evaluation

Evaluation is critical for AI applications, ensuring they perform as intended before deployment. With $110+ billion invested in AI by 2024, 85% of projects were failing to reach production due to inadequate evaluation.

When AWS Bedrock team launched evaluation feature, early signals showed it wasn't landing with users. Fewer than 100 evaluation jobs were created in 3 months. Without a shared vision, five service teams started building their own solutions, creating a fragmented experience. The real risk? Unchecked GenAI could cause harm.

This wasn’t just engineering, it was a design problem demanding strategy, empathy, and org-wide trust.

Design thinking  under pressure 

Delivering model evaluation

With just one weeks to go, I was brought in to unblock this complex project. I applied design thinking treating internal teams as customer proxies to quickly uncover needs and align on delivery.

  • Empathize & Define: Interviewed cross-functional teams to understand user needs, mental models, and pain points. Found that evaluations were slow, inconsistent, and disconnected from real-world use. Teams lacked shared success criteria, relied on siloed tools, and struggled to scale manual reviews.
  • Ideate: Ran 9+ design iterations to map user needs into clear flows combining automation, expert input, and real-world use cases.
  • Prototype: Designed and tested 75+ screens in <2 weeks to explore workflows, validate assumptions, and capture edge cases.
  • Test & Align: Used prototypes to drive fast alignment, unblock delivery, and create a shared product vision.

Action

Forging a Unified Strategy: User Needs at the Core

To transform a failing feature into a cohesive platform solution, I knew we needed both deep user understanding and strong team alignment.

Comprehensive research initiative:

  • survey with 250+ participants
  • 31+ customer interviews
  • Console telemetry analysis
  • Rapid prototyped and tested solutions

2 day workshop with 80+ participant:
Research alone wasn't enough to align teams pulling in different directions. I partnered with the HIL designer to run a foundational 2-day workshop, bringing together 80+ participants across Bedrock, Q, HIL, and SageMaker.

Using sprint-style UX activities, we forced teams to revisit the basics: Who are our users? What do they actually need? What are we really solving for?

Breakthroughs

Three Key Insights that Transformed the Product

Through research and collaboration, we unlocked three game-changing insights.

First, we identified our primary user: the application developer skilled in building apps but with limited AI expertise.

Second, we discovered a breakthrough evaluation method using AI to evaluate AI responses, resulting in 60% lower costs and 10x faster evaluations.

Third, we established a unified execution strategy across four teams, with 30+ features.

Click to see detailed view

Design execution

From Strategy to Seamless Experience

With the strategy aligned and system foundations laid, I turned to design execution. I mapped essential user flows and wireframes focused on clarity and task completion.

Through 15+ rounds of iteration, I refined the designs based on technical constraints, user feedback, and team input. This systematic approach ensured each feature delivered a cohesive experience that developers could trust.

Impact

Design led impact at scale

Re:Invent keynote

Featured in AWS re:Invent 2023 & 2024 in CEO and VP keynotes

+2,517% growth

in evaluation job creation

5+ AWS services

Adopted by Bedrock, Q, CloudWatch, HIL, Sagemaker & MaxDome products

Design-led culture

120+ research-backed issues identified, 79 fixes shipped, 90% reduction in bugs.

Patent-pending

Model as a judge (Maaj) framework filed for scalable model scoring

4 designers mentored

Mentor junior talent across tactical delivery and strategic systems thinking

design led culture shift

Turning Feedback into Usability Wins

After launch, I built a usability backlog of 120+ issues sourced from user research, support tickets, and hands-on testing—ranging from minor friction to key blockers. Each was validated with real users and prioritized by impact. Over 79 fixes shipped, reducing usability-related bugs by 66%. A CSAT score was introduced to measure satisfaction, and quarterly research cycles ensured usability remained a core part of the roadmap.

    Feature walkthrough