References & Sources

Every data point on this site has a source. This page lists all external data sources, academic papers, and benchmark methodologies we reference. See our methodology page for how we score and rank models, and our collection policy for what we will and will not ingest.

Data Sources Academic Papers Compliance & TOS Collection Policy

Data Sources

We refresh data through a scheduled hourly pipeline, with manual provider-status reruns available when needed. Collection uses identified User-Agent strings and is limited to public APIs, public feeds, official blogs and newsroom pages, documentation pages, and other clearly public endpoints that fit the site brief.

Model Pricing & Availability

Hourly

OpenRouter API

Primary pricing source; 500+ models with 8 pricing dimensions. No authentication required.

OpenAI API Pricing

Official GPT and o-series model pricing.

Anthropic Pricing

Official Claude model pricing.

Google AI Pricing

Official Gemini model pricing.

Mistral Pricing

Official Mistral model pricing.

DeepSeek Pricing

Official DeepSeek model pricing.

xAI Documentation

Official Grok model pricing.

Benchmark Scores

As published

LMSYS Chatbot Arena

Crowdsourced Elo ratings from 5M+ human preference votes.

Papers With Code

Aggregated benchmark leaderboards across ML tasks.

HuggingFace Open LLM Leaderboard

Standardised evaluations for open-weight models.

Stanford HELM

Holistic Evaluation of Language Models.

Epoch AI Benchmarks

Benchmark results with historical trend data.

LiveBench

Contamination-free benchmark with monthly question refresh.

ARC Prize / ARC-AGI

Abstract reasoning benchmark measuring fluid intelligence.

SWE-bench Verified

Real-world GitHub issue resolution benchmark.

SWE-bench Pro (Scale AI SEAL)

1,865 long-horizon tasks across 41 repos; harder successor to SWE-bench.

BigCodeBench

Code generation benchmark with complex instructions.

Artificial Analysis

Independent speed and quality benchmarks; source for TTFT and output speed data.

Scale AI SEAL

Expert-driven evaluations and safety benchmarks.

Speed & Latency

Hourly

Artificial Analysis

Primary source for TTFT (time to first token) and output speed measurements.

Provider benchmarks

Official performance data published by OpenAI, Anthropic, Google, and others.

Research & Trend Data

Ongoing

Epoch AI

Largest public database of notable ML models (3,200+ from 1950–present). Training compute estimates, parameter counts, training costs. CC-BY licensed.

Stanford HAI AI Index

Annual comprehensive report tracking AI across technical, economic, and societal dimensions.

Our World in Data — AI

Interactive visualisations of AI model counts, compute growth, and country-level trends.

Safety & Frontier Evaluations

As published

METR

Model Evaluation & Threat Research. Pre-deployment evaluator for frontier models; publishes RE-Bench.

UK AI Security Institute (AISI)

Evaluated 30+ frontier models across cyber, biology, and autonomy domains.

Apollo Research

AI safety evaluations focused on scheming and deception detection.

Model Discovery & News

Hourly

HuggingFace Model Hub

Model cards for open-weight models; parameter counts, licences, release dates.

arXiv

Research papers in cs.AI, cs.LG, and cs.CL.

TechCrunch AI

Industry news and analysis.

VentureBeat AI

AI industry reporting and product coverage.

The Verge AI

Consumer and platform coverage from a public AI-specific feed.

Ars Technica

Technology reporting filtered for AI-relevant coverage.

Provider blogs

Official blogs and newsroom pages from OpenAI, Anthropic, Google, and other major labs.

Academic Papers & Citations

Research papers referenced in our benchmark scores, blog posts, guides, and model evaluations. Sorted by category and year.

Foundational Research

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). NeurIPS 2017.

The Transformer architecture paper.

Improving Language Understanding by Generative Pre-Training

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). OpenAI.

GPT-1.

Language Models are Unsupervised Multitask Learners

Radford, A., Wu, J., Child, R., et al. (2019). OpenAI.

GPT-2.

Language Models are Few-Shot Learners

Brown, T., Mann, B., Ryder, N., et al. (2020). NeurIPS 2020.

GPT-3; introduced in-context learning.

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. (2022). NeurIPS 2022.

InstructGPT / RLHF paper.

Benchmark Methodologies

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., et al. (2021). ICLR 2021.

MMLU benchmark.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B.L., Stickland, A.C., et al. (2023). arXiv.

GPQA benchmark; expert-level questions.

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. (2021). arXiv.

HumanEval benchmark for code generation.

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., et al. (2021). NeurIPS 2021.

MATH benchmark.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). ICLR 2024.

SWE-bench benchmark for software engineering.

On the Measure of Intelligence

Chollet, F. (2019). arXiv.

ARC benchmark; abstract reasoning corpus.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). ICML 2024.

LMSYS Chatbot Arena methodology.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T.Y., Vu, M.C., Chim, J., et al. (2024). arXiv.

BigCodeBench methodology.

LiveBench: A Challenging, Contamination-Free LLM Benchmark

White, C., Dooley, S., Roberts, M., et al. (2024). arXiv.

LiveBench methodology.

Domain-Specific Benchmarks

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Jin, D., Pan, E., Oufattole, N., et al. (2021). Applied Sciences.

MedQA benchmark used in our Healthcare leaderboard.

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Guha, N., Nyarko, J., Ho, D., et al. (2024). NeurIPS 2023 Datasets & Benchmarks Track.

LegalBench benchmark used in our Legal leaderboard.

FinQA: A Dataset of Numerical Reasoning over Financial Data

Chen, Z., Chen, W., Smiley, C., et al. (2021). EMNLP 2021.

FinQA benchmark used in our Finance leaderboard.

FinanceBench: A New Benchmark for Financial Question Answering

Islam, A., Keung, P., Gupta, D., et al. (2023). arXiv.

FinanceBench used in our Finance leaderboard.

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, J., Lu, T., Mishra, S., et al. (2024). ICLR 2024.

WebArena benchmark used in our AI Agents leaderboard.

GAIA: A Benchmark for General AI Assistants

Mialon, G., Dessì, R., Lomeli, M., et al. (2024). ICLR 2024.

GAIA benchmark used in our AI Agents leaderboard.

Model Technical Reports

GPT-4 Technical Report

OpenAI (2023). arXiv.

Foundation for GPT-4o and successors.

The Claude 3 Model Family: A New Standard for Intelligence

Anthropic (2024). Anthropic.

Claude 3 model card and capabilities.

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind (2024). arXiv.

Gemini model family.

The Llama 3 Herd of Models

Meta AI (2024). arXiv.

Llama 3 technical report.

DeepSeek-V3 Technical Report

DeepSeek AI (2024). arXiv.

DeepSeek V3 MoE architecture.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek AI (2025). arXiv.

DeepSeek R1 reasoning model.

Mixtral of Experts

Mistral AI (2024). arXiv.

Mixture-of-Experts architecture.

Safety & Alignment

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., et al. (2022). arXiv.

Anthropic RLHF methodology.

Deep reinforcement learning from human preferences

Christiano, P., Leike, J., Brown, T., et al. (2017). NeurIPS 2017.

Foundational RLHF paper.

Frontier AI Trends Report

UK AI Security Institute (2025). AISI.

Evaluations of 30+ frontier models.

Compliance & Terms of Service

API Usage

OpenRouter API — public endpoint, no authentication required. We use their /api/v1/models endpoint which is explicitly designed for programmatic access.
OpenAI API — we optionally use the models list endpoint to verify model availability. Requires an API key when configured.
HuggingFace Spaces — we access public Gradio API endpoints for Chatbot Arena and Open LLM Leaderboard data.
News collection — we prioritise public APIs, RSS and Atom feeds, official provider blogs, newsroom pages, and other clearly public source surfaces rather than scraping full article bodies.

Web Scraping Practices

All scrapers use an identified User-Agent: The-AI-Resource-Hub-Bot/1.0
We respect robots.txt directives on all sites
Collection runs on a conservative scheduled cadence and stays well below common rate limits
We only access publicly available pages and API endpoints
We do not circumvent paywalls, authentication, or access controls
Pricing data is factual information used for comparison purposes

Data Licensing

Epoch AI — data used under CC-BY 4.0 licence. Attribution: epoch.ai/data
Academic papers — cited under fair use for commentary, comparison, and educational purposes
Benchmark scores — factual data reported from official sources with full attribution
Provider logos/names — used nominatively for identification and comparison

Corrections & Takedowns

If you represent a data source listed here and have concerns about how we use your data, please review the repository and contact the site owner via the GitHub profile. We take accuracy and compliance seriously and will review credible requests promptly.

How We Use This Data

For details on our scoring formula, quality metrics, and update frequency, see our Methodology page. For the practical rules behind source collection, routing, and exclusions, see the Collection Policy.