References & Sources

Every data point on this site has a source. This page lists all external data sources, academic papers, and benchmark methodologies we reference. See our methodology page for how we score and rank models, and our collection policy for what we will and will not ingest.

Data Sources

We refresh data through a scheduled hourly pipeline, with manual provider-status reruns available when needed. Collection uses identified User-Agent strings and is limited to public APIs, public feeds, official blogs and newsroom pages, documentation pages, and other clearly public endpoints that fit the site brief.

Model Pricing & Availability

Hourly

Primary pricing source; 500+ models with 8 pricing dimensions. No authentication required.

Official GPT and o-series model pricing.

Official Claude model pricing.

Official Gemini model pricing.

Official Mistral model pricing.

Official DeepSeek model pricing.

Official Grok model pricing.

Benchmark Scores

As published

Crowdsourced Elo ratings from 5M+ human preference votes.

Aggregated benchmark leaderboards across ML tasks.

Standardised evaluations for open-weight models.

Holistic Evaluation of Language Models.

Benchmark results with historical trend data.

Contamination-free benchmark with monthly question refresh.

Abstract reasoning benchmark measuring fluid intelligence.

Real-world GitHub issue resolution benchmark.

1,865 long-horizon tasks across 41 repos; harder successor to SWE-bench.

Code generation benchmark with complex instructions.

Independent speed and quality benchmarks; source for TTFT and output speed data.

Expert-driven evaluations and safety benchmarks.

Speed & Latency

Hourly

Primary source for TTFT (time to first token) and output speed measurements.

Provider benchmarks

Official performance data published by OpenAI, Anthropic, Google, and others.

Research & Trend Data

Ongoing

Largest public database of notable ML models (3,200+ from 1950–present). Training compute estimates, parameter counts, training costs. CC-BY licensed.

Annual comprehensive report tracking AI across technical, economic, and societal dimensions.

Interactive visualisations of AI model counts, compute growth, and country-level trends.

Safety & Frontier Evaluations

As published

Model Evaluation & Threat Research. Pre-deployment evaluator for frontier models; publishes RE-Bench.

Evaluated 30+ frontier models across cyber, biology, and autonomy domains.

AI safety evaluations focused on scheming and deception detection.

Model Discovery & News

Hourly

Model cards for open-weight models; parameter counts, licences, release dates.

Research papers in cs.AI, cs.LG, and cs.CL.

Industry news and analysis.

AI industry reporting and product coverage.

Consumer and platform coverage from a public AI-specific feed.

Technology reporting filtered for AI-relevant coverage.

Provider blogs

Official blogs and newsroom pages from OpenAI, Anthropic, Google, and other major labs.

Academic Papers & Citations

Research papers referenced in our benchmark scores, blog posts, guides, and model evaluations. Sorted by category and year.

Foundational Research

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). NeurIPS 2017.

The Transformer architecture paper.

Improving Language Understanding by Generative Pre-Training

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). OpenAI.

GPT-1.

Language Models are Unsupervised Multitask Learners

Radford, A., Wu, J., Child, R., et al. (2019). OpenAI.

GPT-2.

Language Models are Few-Shot Learners

Brown, T., Mann, B., Ryder, N., et al. (2020). NeurIPS 2020.

GPT-3; introduced in-context learning.

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. (2022). NeurIPS 2022.

InstructGPT / RLHF paper.

Benchmark Methodologies

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., et al. (2021). ICLR 2021.

MMLU benchmark.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B.L., Stickland, A.C., et al. (2023). arXiv.

GPQA benchmark; expert-level questions.

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. (2021). arXiv.

HumanEval benchmark for code generation.

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., et al. (2021). NeurIPS 2021.

MATH benchmark.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). ICLR 2024.

SWE-bench benchmark for software engineering.

On the Measure of Intelligence

Chollet, F. (2019). arXiv.

ARC benchmark; abstract reasoning corpus.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). ICML 2024.

LMSYS Chatbot Arena methodology.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T.Y., Vu, M.C., Chim, J., et al. (2024). arXiv.

BigCodeBench methodology.

LiveBench: A Challenging, Contamination-Free LLM Benchmark

White, C., Dooley, S., Roberts, M., et al. (2024). arXiv.

LiveBench methodology.

Domain-Specific Benchmarks

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Jin, D., Pan, E., Oufattole, N., et al. (2021). Applied Sciences.

MedQA benchmark used in our Healthcare leaderboard.

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Guha, N., Nyarko, J., Ho, D., et al. (2024). NeurIPS 2023 Datasets & Benchmarks Track.

LegalBench benchmark used in our Legal leaderboard.

FinQA: A Dataset of Numerical Reasoning over Financial Data

Chen, Z., Chen, W., Smiley, C., et al. (2021). EMNLP 2021.

FinQA benchmark used in our Finance leaderboard.

FinanceBench: A New Benchmark for Financial Question Answering

Islam, A., Keung, P., Gupta, D., et al. (2023). arXiv.

FinanceBench used in our Finance leaderboard.

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, J., Lu, T., Mishra, S., et al. (2024). ICLR 2024.

WebArena benchmark used in our AI Agents leaderboard.

GAIA: A Benchmark for General AI Assistants

Mialon, G., Dessì, R., Lomeli, M., et al. (2024). ICLR 2024.

GAIA benchmark used in our AI Agents leaderboard.

Model Technical Reports

GPT-4 Technical Report

OpenAI (2023). arXiv.

Foundation for GPT-4o and successors.

The Claude 3 Model Family: A New Standard for Intelligence

Anthropic (2024). Anthropic.

Claude 3 model card and capabilities.

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind (2024). arXiv.

Gemini model family.

The Llama 3 Herd of Models

Meta AI (2024). arXiv.

Llama 3 technical report.

DeepSeek-V3 Technical Report

DeepSeek AI (2024). arXiv.

DeepSeek V3 MoE architecture.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek AI (2025). arXiv.

DeepSeek R1 reasoning model.

Mixtral of Experts

Mistral AI (2024). arXiv.

Mixture-of-Experts architecture.

Safety & Alignment

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., et al. (2022). arXiv.

Anthropic RLHF methodology.

Deep reinforcement learning from human preferences

Christiano, P., Leike, J., Brown, T., et al. (2017). NeurIPS 2017.

Foundational RLHF paper.

Frontier AI Trends Report

UK AI Security Institute (2025). AISI.

Evaluations of 30+ frontier models.

Compliance & Terms of Service

API Usage

  • OpenRouter API — public endpoint, no authentication required. We use their /api/v1/models endpoint which is explicitly designed for programmatic access.
  • OpenAI API — we optionally use the models list endpoint to verify model availability. Requires an API key when configured.
  • HuggingFace Spaces — we access public Gradio API endpoints for Chatbot Arena and Open LLM Leaderboard data.
  • News collection — we prioritise public APIs, RSS and Atom feeds, official provider blogs, newsroom pages, and other clearly public source surfaces rather than scraping full article bodies.

Web Scraping Practices

  • All scrapers use an identified User-Agent: The-AI-Resource-Hub-Bot/1.0
  • We respect robots.txt directives on all sites
  • Collection runs on a conservative scheduled cadence and stays well below common rate limits
  • We only access publicly available pages and API endpoints
  • We do not circumvent paywalls, authentication, or access controls
  • Pricing data is factual information used for comparison purposes

Data Licensing

  • Epoch AI — data used under CC-BY 4.0 licence. Attribution: epoch.ai/data
  • Academic papers — cited under fair use for commentary, comparison, and educational purposes
  • Benchmark scores — factual data reported from official sources with full attribution
  • Provider logos/names — used nominatively for identification and comparison

Corrections & Takedowns

If you represent a data source listed here and have concerns about how we use your data, please review the repository and contact the site owner via the GitHub profile. We take accuracy and compliance seriously and will review credible requests promptly.

How We Use This Data

For details on our scoring formula, quality metrics, and update frequency, see our Methodology page. For the practical rules behind source collection, routing, and exclusions, see the Collection Policy.