Conkurrence is an MCP server and AI evaluation toolkit designed to measure agreement and reliability across multiple large language model (LLM) providers such as OpenAI, Bedrock, and Gemini. It statistically assesses inter-rater reliability with metrics like Fleiss' kappa and bootstrap confidence intervals, and offers trend tracking, self-consistency checks, schema suggestion/validation, and cost estimation. Aimed at AI engineers, researchers, and evaluators, it integrates easily with popular AI assistants and clients (e.g., Claude Desktop) using the MCP protocol, enabling rigorous, repeatable, and transparent LLM evaluation in production or research workflows.
Visit Conkurrence's official website for product details and getting started.