Bedrock Evaluations answers the question every GenAI builder faces: does it actually work? Are the responses accurate, safe, and useful? This tool gives teams a consistent way to measure quality - through automated, human, and expert reviews both during development and after applications go live.
Role:
Principal Designer leading end-to-end GenAI evaluation strategy and feature delivery across 4 AWS teams.
Responsibility:
Timeline:
Oct 2023 - Jan 2025
Problem statement
Evaluation is critical for AI applications, ensuring they perform as intended before deployment. With $110+ billion invested in AI by 2024, 85% of projects were failing to reach production due to inadequate evaluation.
When AWS Bedrock team launched evaluation feature, early signals showed it wasn't landing with users. Fewer than 100 evaluation jobs were created in 3 months. Without a shared vision, five service teams started building their own solutions, creating a fragmented experience. The real risk? Unchecked GenAI could cause harm.
This wasn’t just engineering, it was a design problem demanding strategy, empathy, and org-wide trust.
Design thinking under pressure
With just one weeks to go, I was brought in to unblock this complex project. I applied design thinking treating internal teams as customer proxies to quickly uncover needs and align on delivery.
Action
To transform a failing feature into a cohesive platform solution, I knew we needed both deep user understanding and strong team alignment.
Comprehensive research initiative:
2 day workshop with 80+ participant:
Research alone wasn't enough to align teams pulling in different directions. I partnered with the HIL designer to run a foundational 2-day workshop, bringing together 80+ participants across Bedrock, Q, HIL, and SageMaker.
Using sprint-style UX activities, we forced teams to revisit the basics: Who are our users? What do they actually need? What are we really solving for?
Breakthroughs
Through research and collaboration, we unlocked three game-changing insights.
First, we identified our primary user: the application developer skilled in building apps but with limited AI expertise.
Second, we discovered a breakthrough evaluation method using AI to evaluate AI responses, resulting in 60% lower costs and 10x faster evaluations.
Third, we established a unified execution strategy across four teams, with 30+ features.
Design execution
With the strategy aligned and system foundations laid, I turned to design execution. I mapped essential user flows and wireframes focused on clarity and task completion.
Through 15+ rounds of iteration, I refined the designs based on technical constraints, user feedback, and team input. This systematic approach ensured each feature delivered a cohesive experience that developers could trust.
Impact
Featured in AWS re:Invent 2023 & 2024 in CEO and VP keynotes
in evaluation job creation
Adopted by Bedrock, Q, CloudWatch, HIL, Sagemaker & MaxDome products
120+ research-backed issues identified, 79 fixes shipped, 90% reduction in bugs.
Model as a judge (Maaj) framework filed for scalable model scoring
Mentor junior talent across tactical delivery and strategic systems thinking
design led culture shift
After launch, I built a usability backlog of 120+ issues sourced from user research, support tickets, and hands-on testing—ranging from minor friction to key blockers. Each was validated with real users and prioritized by impact. Over 79 fixes shipped, reducing usability-related bugs by 66%. A CSAT score was introduced to measure satisfaction, and quarterly research cycles ensured usability remained a core part of the roadmap.