Methodology

How we score, rank, and compare AI models. Full transparency on our data sources, scoring methodology, and update frequency.

Independence Statement

The AI Resource Hub is fully independent. We are not affiliated with, sponsored by, or funded by any AI provider. We do not accept payment for rankings or reviews. Our goal is to provide accurate, unbiased data to help you make informed decisions.

Data Sources

We maintain a public record of every external source used to build and update this site. If you spot a source that should be added or corrected, please open an issue on our GitHub repository.

Model Pricing

Hourly

OpenRouter API — primary pricing source; 500+ models, 8 pricing dimensions (prompt, completion, image, request, cache read/write, web search, internal reasoning). Checked hourly via the scheduled refresh pipeline.
OpenAI API Pricing — cross-reference for GPT and o-series models
Anthropic Pricing — cross-reference for Claude models
Google AI Pricing — cross-reference for Gemini models
Mistral Pricing — cross-reference for Mistral models

Benchmark Scores

As published

We track 34 benchmarks across general intelligence, coding, math, reasoning, safety, agent capabilities, domain-specific, multilingual, and multimodal categories. Scores are sourced from:

Official model technical reports and blog posts (OpenAI, Anthropic, Google DeepMind, Meta, Mistral, etc.)
Papers With Code — State of the Art — aggregated benchmark leaderboards
arXiv preprints — research papers reporting model evaluations
HuggingFace Open LLM Leaderboard — standardised evals for open-weight models
Stanford HELM — holistic evaluation of language models
Epoch AI Benchmarks — benchmark results evaluated internally and collected from external sources, with historical trend data
LiveBench — contamination-free benchmark with monthly question refresh; covers math, coding, reasoning, data analysis, instruction following
ARC Prize / ARC-AGI — abstract reasoning benchmark measuring fluid intelligence; adopted by Anthropic, Google DeepMind, OpenAI, and xAI in model cards
SWE-bench Verified — real-world GitHub issue resolution; the standard coding benchmark until saturation (~70% scores) in mid-2025. SWE-bench Pro (Scale AI SEAL, released late 2025) is the harder successor — 1,865 long-horizon tasks across 41 repos including proprietary commercial codebases; current frontier scores ~45%
BigCodeBench — code generation benchmark with complex instructions and diverse function calls
HumanEval (OpenAI), MATH (DeepMind), MMLU, GPQA — individual benchmark repositories

Community & Consensus Ratings

As published

LMSYS Chatbot Arena — crowdsourced Elo ratings from 5M+ human preference votes; informs community consensus factor in quality scores
Artificial Analysis — real-time speed and quality benchmarks; referenced for speed data and Intelligence Index rankings
Scale AI SEAL — expert-driven evaluations and safety benchmarks

Research & Trend Data

Ongoing

Epoch AI — largest public database of notable ML models (3,200+ from 1950–present); training compute estimates, parameter counts, training costs. Notable AI Models database tracks 900+ models chosen for historical significance. CC-BY licensed.
Stanford HAI AI Index — annual comprehensive report tracking AI across technical, economic, and societal dimensions; referenced for industry-wide statistics and trends
Our World in Data — AI — interactive visualisations of AI model counts, compute growth, and country-level trends (primarily sourced from Epoch AI data)

Safety & Frontier Evaluations

As published

METR (Model Evaluation & Threat Research) — nonprofit evaluating frontier AI models for autonomous capabilities; publishes task completion time horizons and RE-Bench. Pre-deployment evaluator for OpenAI and Anthropic.
UK AI Security Institute (AISI) — has evaluated 30+ frontier models since 2023 across cyber, biology, and autonomy domains; publishes the Frontier AI Trends Report
Apollo Research — AI safety evaluations focused on scheming, deception detection, and interpretability; collaborates with AISI and frontier labs

Model Discovery & New Releases

Weekly

HuggingFace — Presidentlin's Collections — weekly curated "AI Release Week Thread" collections tracking every notable model, dataset, and paper release across the HuggingFace ecosystem
HuggingFace Model Hub — model cards for open-weight models; parameter counts, licences, release dates
Official provider announcements and release blogs (OpenAI, Anthropic, Google, Meta, Mistral, xAI, etc.)

Provider & Company Information

As announced

Currently tracking 40 AI providers. Company data is sourced from:

Official company websites, about pages, and investor relations sections
Press releases, funding announcements, and SEC filings
Verified reporting from TechCrunch, Reuters, The Information, Bloomberg, and Financial Times
Crunchbase — funding rounds and company details

News & Articles

Hourly

News is aggregated from public APIs and feeds, official provider blogs and newsroom pages, and archived digest imports when available. Sources include:

TechCrunch AI, Reuters Technology, The Verge, Ars Technica, VentureBeat AI
AI-specialist publications: The Decoder, Import AI (Jack Clark), Interconnects (Nathan Lambert)
Provider engineering blogs: OpenAI, Anthropic, Google DeepMind, Meta AI Research
arXiv paper announcements (cs.AI, cs.LG, cs.CL)

Updated through the scheduled hourly refresh pipeline, with manual provider-status reruns available when needed.

People & Key Figures

As announced

Official bios and LinkedIn profiles
Wikipedia for public figures with verified career histories
X/Twitter profiles for roles and affiliations
Published interviews, podcasts, and conference talks

AI Tools Directory

Curated

Official product websites and documentation
Product Hunt — AI — discovery of new AI tools
Editorial curation by The AI Resource Hub team

Leaderboard Views

We now separate what is current from what is heavily benchmarked. Those are related questions, but they are not the same question.

View	What it answers	How to read it
Frontier now	Which flagship and newly released models are current right now?	Curated watchlist with status and evidence. Not a synthetic score.
Reliability Floor	Does the model clear everyday trust basics: facts, grounding, retrieval, instructions, memory, and abstention?	Strict floor score. Missing critical dimensions are shown as missing, not filled with generic quality scores.
Evaluated composite	Which models are currently strongest inside the benchmark-backed scored set?	Weighted score designed to compare models with public evidence, not to replace the frontier watchlist.

Evaluated composite factor	Weight / effect	Why it exists
Normalized benchmark performance	50%	Still the largest input, but no longer strong enough to let old benchmark history dominate the ranking.
Existing quality layer	25%	Lets broader capability signals still count.
Freshness	25%	A stronger recency signal so current generations beat stale leaders unless the evidence gap is overwhelming.
Evidence multiplier	Penalty, not bonus	Thin public evidence gets penalized until coverage improves.
Provider age penalty	Penalty, not bonus	Older provider generations lose ground when a newer public generation exists.
Variant penalty	Penalty, not bonus	Compact and preview variants are stopped from reading like the obvious overall best model.

Reliability Floor

Reliability Floor is the Hub's strict everyday-trust meta-benchmark. It is designed for the normal-user basics that still break too often: factual accuracy, grounded answers, retrieval from supplied context, exact instruction following, memory, and abstention.

Dimension	Weight	Role
Factual accuracy Does the model answer verifiable questions accurately instead of confidently guessing?	25%	Critical gate
Groundedness Does the model stay faithful to supplied sources and avoid unsupported claims?	20%	Critical gate
Retrieval Can the model find and use the right information inside long or noisy context?	20%	Critical gate
Instruction following Does the model follow exact user constraints, formats, exclusions, and ordering?	20%	Critical gate
Memory and abstention Can the model preserve relevant user information and admit when evidence is missing?	15%	Supporting signal

Floor score rule

We compute the weighted average only from comparable evidence, then cap the headline score at the weakest critical dimension plus 10 points. A model cannot be labelled Reliable for general use unless factuality, groundedness, retrieval, and instruction following are all present and each is at least 85.

floor = min(weighted average, weakest critical dimension + 10)

Missing evidence is rendered as "No comparable evidence yet". It is never replaced with the general quality score or provider marketing claims.

Prompt and harness context

Benchmark outcomes measure model + prompt + harness + judge, not the model alone. V1 records prompt style, zero-shot or few-shot setup, system prompt, tool access, temperature or reasoning effort, judge type, repeats, and evaluation framework wherever the source makes that available. Prompt robustness is important, but it stays context-first until the data is strong enough to score honestly.

Future internal harness families are documented as Know, Check, Or Abstain; Grounded Answer Only; and Follow The Exact Brief. Supporting references include PromptBench, RobustAlpacaEval, lm-evaluation-harness, Epoch methodology, and Artificial Analysis methodology.

Tracked source	Dimension	How it is used
SimpleQA official	Factual accuracy	Good first signal for short factuality, but too narrow to prove broad reliability.
FACTS Grounding benchmark-maintainer	Groundedness	Tracked as a grounding source; no comparable local score rows are currently present.
IFEval research	Instruction following	Tracked as the first instruction-following source; score rows will rank once ingested.
RULER research	Retrieval	Useful retrieval stress test; should be paired with more realistic long-context sources.
HELMET research	Retrieval	Better real-task counterweight to synthetic retrieval probes.
CRAG research	Groundedness	Treat as RAG-system evidence rather than a pure base-model score.
LongMemEval research	Memory and abstention	Memory claims are easy to overstate; keep limitations visible.
LiveBench instruction tasks benchmark-maintainer	Instruction following	Local rows are currently aggregate LiveBench, so they remain context until category-level rows exist.
HELM reliability scenarios research	Factual accuracy	Tracked as provenance and future context. It is not scored locally until comparable reliability-specific HELM rows are ingested.

Quality Score

Each model receives a quality score from 0–100 based on a weighted combination of factors:

Factor	Weight	Source
Benchmark performance (avg across available benchmarks)	40%	MMLU, GPQA, HumanEval, MATH, etc.
Capability breadth (modalities, tools, features)	20%	Official documentation
Context window and output capacity	15%	API specifications
Community and expert consensus	15%	Arena Elo, community feedback
Recency and update status	10%	Release date, active status

Value Score

The value score measures quality-per-dollar — how much capability you get for your money. Calculated as:

Value Score = (Quality Score / Blended Price) × 10

Where blended price uses a 3:1 output-to-input ratio: (input_price + 3 × output_price) / 4. This ratio reflects typical API usage patterns where output tokens cost more and represent 75% of total spend.

Update Frequency

Hourly (Automated)

API pricing from OpenRouter + provider pages
Speed & TTFT data from Artificial Analysis
Benchmark score validation
Cross-validation: provider vs OpenRouter pricing
Data staleness report
Live news refresh from public RSS feeds and official blogs
Site rebuild and deployment

As Published

New model releases
Benchmark scores
Provider information changes
People, tools, and content

Limitations

Benchmarks have known issues — data contamination, narrow evaluation, and saturation effects mean no single benchmark tells the full story. We use multiple benchmarks and flag known issues.
Pricing changes frequently — AI model pricing can change without notice. We refresh hourly, but a provider page or upstream cache can still lag briefly before the site catches up.
Speed data is limited — we track relative speed ratings rather than real-time latency measurements. For production speed data, we recommend Artificial Analysis.
Coverage is curated — we focus on models available via API and major consumer products. Some niche or regional models may not be listed.
Quality scores are editorial — while based on benchmark data, final quality scores involve editorial judgement. See our scoring breakdown above for transparency.

Found an Error?

If you spot incorrect data, outdated pricing, or missing models, please open an issue on our GitHub repository. We take accuracy seriously and will update within 24 hours of verified corrections.

For a complete list of all data sources, academic citations, and compliance details, see our References & Sources page and our Collection Policy.