Methodology

How we score, rank, and compare AI models. Full transparency on our data sources, scoring methodology, and update frequency.

Independence Statement

The AI Resource Hub is fully independent. We are not affiliated with, sponsored by, or funded by any AI provider. We do not accept payment for rankings or reviews. Our goal is to provide accurate, unbiased data to help you make informed decisions.

Data Sources

We maintain a public record of every external source used to build and update this site. If you spot a source that should be added or corrected, please open an issue on our GitHub repository.

Model Pricing

Hourly
  • OpenRouter API — primary pricing source; 500+ models, 8 pricing dimensions (prompt, completion, image, request, cache read/write, web search, internal reasoning). Checked hourly via the scheduled refresh pipeline.
  • OpenAI API Pricing — cross-reference for GPT and o-series models
  • Anthropic Pricing — cross-reference for Claude models
  • Google AI Pricing — cross-reference for Gemini models
  • Mistral Pricing — cross-reference for Mistral models

Benchmark Scores

As published

We track 17 benchmarks across general intelligence, coding, math, reasoning, safety, agent capabilities, domain-specific, multilingual, and multimodal categories. Scores are sourced from:

  • Official model technical reports and blog posts (OpenAI, Anthropic, Google DeepMind, Meta, Mistral, etc.)
  • Papers With Code — State of the Art — aggregated benchmark leaderboards
  • arXiv preprints — research papers reporting model evaluations
  • HuggingFace Open LLM Leaderboard — standardised evals for open-weight models
  • Stanford HELM — holistic evaluation of language models
  • Epoch AI Benchmarks — benchmark results evaluated internally and collected from external sources, with historical trend data
  • LiveBench — contamination-free benchmark with monthly question refresh; covers math, coding, reasoning, data analysis, instruction following
  • ARC Prize / ARC-AGI — abstract reasoning benchmark measuring fluid intelligence; adopted by Anthropic, Google DeepMind, OpenAI, and xAI in model cards
  • SWE-bench Verified — real-world GitHub issue resolution; the standard coding benchmark until saturation (~70% scores) in mid-2025. SWE-bench Pro (Scale AI SEAL, released late 2025) is the harder successor — 1,865 long-horizon tasks across 41 repos including proprietary commercial codebases; current frontier scores ~45%
  • BigCodeBench — code generation benchmark with complex instructions and diverse function calls
  • HumanEval (OpenAI), MATH (DeepMind), MMLU, GPQA — individual benchmark repositories

Community & Consensus Ratings

As published
  • LMSYS Chatbot Arena — crowdsourced Elo ratings from 5M+ human preference votes; informs community consensus factor in quality scores
  • Artificial Analysis — real-time speed and quality benchmarks; referenced for speed data and Intelligence Index rankings
  • Scale AI SEAL — expert-driven evaluations and safety benchmarks

Research & Trend Data

Ongoing
  • Epoch AI — largest public database of notable ML models (3,200+ from 1950–present); training compute estimates, parameter counts, training costs. Notable AI Models database tracks 900+ models chosen for historical significance. CC-BY licensed.
  • Stanford HAI AI Index — annual comprehensive report tracking AI across technical, economic, and societal dimensions; referenced for industry-wide statistics and trends
  • Our World in Data — AI — interactive visualisations of AI model counts, compute growth, and country-level trends (primarily sourced from Epoch AI data)

Safety & Frontier Evaluations

As published

Model Discovery & New Releases

Weekly
  • HuggingFace — Presidentlin's Collections — weekly curated "AI Release Week Thread" collections tracking every notable model, dataset, and paper release across the HuggingFace ecosystem
  • HuggingFace Model Hub — model cards for open-weight models; parameter counts, licences, release dates
  • Official provider announcements and release blogs (OpenAI, Anthropic, Google, Meta, Mistral, xAI, etc.)

Provider & Company Information

As announced

Currently tracking 39 AI providers. Company data is sourced from:

  • Official company websites, about pages, and investor relations sections
  • Press releases, funding announcements, and SEC filings
  • Verified reporting from TechCrunch, Reuters, The Information, Bloomberg, and Financial Times
  • Crunchbase — funding rounds and company details

News & Articles

Hourly

News is aggregated from public APIs and feeds, official provider blogs and newsroom pages, and archived digest imports when available. Sources include:

  • TechCrunch AI, Reuters Technology, The Verge, Ars Technica, VentureBeat AI
  • AI-specialist publications: The Decoder, Import AI (Jack Clark), Interconnects (Nathan Lambert)
  • Provider engineering blogs: OpenAI, Anthropic, Google DeepMind, Meta AI Research
  • arXiv paper announcements (cs.AI, cs.LG, cs.CL)

Updated through the scheduled hourly refresh pipeline, with manual provider-status reruns available when needed.

People & Key Figures

As announced
  • Official bios and LinkedIn profiles
  • Wikipedia for public figures with verified career histories
  • X/Twitter profiles for roles and affiliations
  • Published interviews, podcasts, and conference talks

AI Tools Directory

Curated
  • Official product websites and documentation
  • Product Hunt — AI — discovery of new AI tools
  • Editorial curation by The AI Resource Hub team

Leaderboard Views

We now separate what is current from what is heavily benchmarked. Those are related questions, but they are not the same question.

View What it answers How to read it
Frontier now Which flagship and newly released models are current right now? Curated watchlist with status and evidence. Not a synthetic score.
Evaluated composite Which models are currently strongest inside the benchmark-backed scored set? Weighted score designed to compare models with public evidence, not to replace the frontier watchlist.
Evaluated composite factor Weight / effect Why it exists
Normalized benchmark performance 50% Still the largest input, but no longer strong enough to let old benchmark history dominate the ranking.
Existing quality layer 25% Lets broader capability signals still count.
Freshness 25% A stronger recency signal so current generations beat stale leaders unless the evidence gap is overwhelming.
Evidence multiplier Penalty, not bonus Thin public evidence gets penalized until coverage improves.
Provider age penalty Penalty, not bonus Older provider generations lose ground when a newer public generation exists.
Variant penalty Penalty, not bonus Compact and preview variants are stopped from reading like the obvious overall best model.

Quality Score

Each model receives a quality score from 0–100 based on a weighted combination of factors:

Factor Weight Source
Benchmark performance (avg across available benchmarks) 40% MMLU, GPQA, HumanEval, MATH, etc.
Capability breadth (modalities, tools, features) 20% Official documentation
Context window and output capacity 15% API specifications
Community and expert consensus 15% Arena Elo, community feedback
Recency and update status 10% Release date, active status

Value Score

The value score measures quality-per-dollar — how much capability you get for your money. Calculated as:

Value Score = (Quality Score / Blended Price) × 10

Where blended price uses a 3:1 output-to-input ratio: (input_price + 3 × output_price) / 4. This ratio reflects typical API usage patterns where output tokens cost more and represent 75% of total spend.

Update Frequency

Hourly (Automated)

  • API pricing from OpenRouter + provider pages
  • Speed & TTFT data from Artificial Analysis
  • Benchmark score validation
  • Cross-validation: provider vs OpenRouter pricing
  • Data staleness report
  • Live news refresh from public RSS feeds and official blogs
  • Site rebuild and deployment

As Published

  • New model releases
  • Benchmark scores
  • Provider information changes
  • People, tools, and content

Limitations

  • Benchmarks have known issues — data contamination, narrow evaluation, and saturation effects mean no single benchmark tells the full story. We use multiple benchmarks and flag known issues.
  • Pricing changes frequently — AI model pricing can change without notice. We refresh hourly, but a provider page or upstream cache can still lag briefly before the site catches up.
  • Speed data is limited — we track relative speed ratings rather than real-time latency measurements. For production speed data, we recommend Artificial Analysis.
  • Coverage is curated — we focus on models available via API and major consumer products. Some niche or regional models may not be listed.
  • Quality scores are editorial — while based on benchmark data, final quality scores involve editorial judgement. See our scoring breakdown above for transparency.

Found an Error?

If you spot incorrect data, outdated pricing, or missing models, please open an issue on our GitHub repository. We take accuracy seriously and will update within 24 hours of verified corrections.

For a complete list of all data sources, academic citations, and compliance details, see our References & Sources page and our Collection Policy.