What's new

Tool stack

Built-in utilities plus a curated map of the wider AI tool market.

Use this page for three things: the working tools already built into this site, the official tokenizer and token-counting resources from major labs, and a curated shortlist of important external AI products.

Benchmark trackers and eval resources

External sources worth checking when you want deeper benchmark explorers, repeated agent measurements, and degradation tracking.

Open benchmark desk

Factuality and hallucination

SimpleQA

OpenAI publication + open benchmark

Short-form factuality benchmark for fact-seeking questions with single verifiable answers. Useful for measuring whether a model answers accurately instead of confidently guessing.

Strong fit for the Reliability Floor factuality dimension, but too narrow to stand alone as a general reliability score.

  • Targets hallucination on short factual questions
  • Includes correct, incorrect, and not-attempted grading
  • Useful for calibration and abstention analysis
  • OpenAI reports the benchmark was designed to remain challenging for frontier models

Grounded long-form factuality

FACTS Grounding

Google DeepMind benchmark + Kaggle leaderboard

Benchmark for whether long-form answers stay faithful to supplied documents while still addressing the user request.

Good fit for the Reliability Floor groundedness dimension because it tests source-grounded answers rather than broad world knowledge alone.

  • Long-form grounded response evaluation
  • Public and private evaluation sets
  • Documents span finance, technology, retail, medicine, and law
  • Uses separate eligibility and factual-grounding judgement phases

Instruction following

IFEval

Paper + Google Research code/data

Verifiable instruction-following benchmark covering constraints such as word counts, required keywords, and formatting rules.

Useful because it avoids purely subjective judging for a core reliability question: did the model do exactly what it was told?

  • Around 500 prompts
  • 25 verifiable instruction types
  • Objective checks for many prompt constraints
  • Good first source for instruction-following sub-scores

Long-context retrieval

RULER

Paper + open-source code

Long-context benchmark that extends needle-in-haystack into harder retrieval, multi-hop tracing, aggregation, and question-answering tasks.

Use as a retrieval stress test, not as proof of real-world long-context reliability by itself.

  • 13 representative long-context tasks
  • Tests behavior beyond simple literal retrieval
  • Configurable sequence length and task complexity
  • Exposes failures as context length increases

Real long-context tasks

HELMET

Hugging Face benchmark + paper

Holistic long-context evaluation suite with more application-like tasks including retrieval-augmented generation, citations, summarization, and reranking.

Useful counterweight to simple needle tests because the authors explicitly warn that simple synthetic retrieval does not predict real downstream performance well.

  • Evaluates diverse long-context applications
  • Controls input length and task complexity
  • Includes RAG, citation, summarization, and reranking-style tasks
  • Shows categories do not always correlate with each other

RAG factual QA

CRAG

NeurIPS paper + GitHub

Comprehensive RAG benchmark for dynamic factual question answering with web and knowledge-graph search simulations.

Best treated as a RAG-system reliability source, because scores depend on retrieval setup as well as the base model.

  • 4,409 factual QA pairs
  • Five domains and eight question categories
  • Covers popularity and temporal dynamism
  • Measures hallucination pressure in retrieval-augmented answers

Long-term memory

LongMemEval

OpenReview paper

Benchmark for long-term assistant memory across sustained interactions, including extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Useful for the Reliability Floor memory dimension, but memory claims need especially careful evidence labels and limitations.

  • 500 curated questions
  • Scalable user-assistant chat histories
  • Tests five core long-term memory abilities
  • Separates memory design choices across indexing, retrieval, and reading stages

Fresh objective tasks

LiveBench

Leaderboard + paper + GitHub

Frequently updated objective benchmark covering math, coding, reasoning, language, instruction following, and data analysis.

Useful for fresh-data discipline and objective grading; only the relevant categories should contribute to the Reliability Floor.

  • Frequently updated questions
  • Objective ground-truth scoring
  • Includes instruction following and data analysis categories
  • Designed to reduce contamination risk

Holistic model evaluation

HELM

Stanford CRFM benchmark + framework

Stanford CRFM evaluation framework and benchmark collection designed to make model comparisons more transparent across scenarios, metrics, prompts, and adapters.

Useful as a provenance model for the Reliability Floor because it treats prompt, harness, scenario, and metric choices as part of the result rather than invisible background detail.

  • Explicit scenario and metric structure
  • Records model, adapter, prompt, and metric details
  • Covers multiple capability and risk dimensions
  • Good reference model for transparent eval reporting

Agent evals and degradation tracking

MarginLab

Public site + docs + GitHub

Open benchmark and eval ecosystem focused on robust, reproducible agent testing, with public degradation trackers and historical views for coding agents.

Useful when you want repeated agent and harness measurement rather than a single static benchmark snapshot.

  • Degradation Tracker for agent performance drift
  • Historical performance views over time
  • Benchmark explorers including SWE-Bench Pro and Terminal-Bench 2.0
  • Open-source eval runtime that tracks accuracy, tokens, duration, and traces

Filter the external tool list

Search by name, category, or tag. Category chips jump you straight to the section.

Choosing the right tool

  • Most AI chatbots offer free tiers, so it is worth trying several before paying for one default.
  • For coding, Cursor and Claude Code are strong starting points if you want fast repo-aware assistance.
  • For local models, start with LM Studio or Ollama before moving to lower-level runtimes like llama.cpp or MLX.
  • Use official token-counting tools when prompt cost or context fit really matters.
  • Check our LLM comparison for detailed pricing and benchmark data.