Ceridwen.ai

Benchmarking & Evaluation Engineer

Full-Time  |  Remote (US)  |  Equity + Salary Upon Revenue

The Role

You are the immune system. You catch the failures before they reach users, before they reach the board, before they spend three months inventing plausible percentages. You build the frameworks that prove MABOS works — or prove it doesn’t, which is more valuable.

What You'll Do

  • Design and implement comprehensive evaluation frameworks for all MABOS subsystems
  • Build automated testing pipelines that catch hallucination, drift, and degradation
  • Develop domain-specific benchmarks for cognitive architecture performance
  • Create monitoring systems that detect failure modes in production
  • Publish evaluation methodology and results to maintain transparency

Requirements

  • MS or PhD in computer science, statistics, or adjacent field
  • 5+ years building evaluation and testing frameworks for ML/AI systems
  • Statistical literacy. You know the difference between a metric that matters and a metric that flatters
  • Failure mode expertise. Identifying and characterizing failures in production AI systems
  • You have personally caught a critical failure that no one else noticed, and you can tell us about it

Compensation

All positions include equity in Ceridwen.ai. Salaries are TBD and will be determined based on role scope, experience, and what you bring to the table. We will not insult you with a lowball offer, and we expect you not to waste our time with inflated expectations disconnected from contribution.

The Builder Clause

We don't care where you went to school. We don't care if you went to school. Our founder is self-taught, started coding at 13, and built a 602,000-line cognitive architecture without a CS degree.

Meet the qualifications, or show us what you've built. Either path works. Both paths demand excellence.

Apply

Ready to move? Send a short note of relevant proof-of-work — past shipped projects, metrics you moved, or a draft of your first 30 days.