70 Years of AI: A Data Story

Contents

Introduction
The Exponential Surge
When Scaling Laws Broke Down
The Geography of Intelligence
Open vs. Closed: The Access Divide
Four Tiers of AI
Language vs. Multimodal
The Price of Intelligence
What Actually Makes a Model Notable?
The Family Tree
Conclusion

Introduction

In 2024, researchers published 944 AI models, roughly three per day. That is more than in the entire period from 1950 to 2018 combined. The pace at which this field moves makes it almost impossible to keep up anecdotally, which is exactly why data matters.

This analysis is built on the Epoch AI Models dataset, a systematically curated record of over 3,300 AI systems spanning seven decades, from ENIAC-era perceptrons to frontier large language models. Every entry includes training compute, parameter counts, publishing organization, country of origin, training cost, and whether the weights were made publicly available.

Three questions sit at the center of this story. Who is building the frontier of AI, and is that power concentrating or spreading? What does it cost to train a competitive model, and how has that changed? And who gets access to the results? The open vs. closed debate is not just philosophical; it determines who can build on top of these systems and who cannot.

The dataset required substantial cleaning before analysis: 15 duplicate records were removed, multi-valued fields (Domain, Country, Organization) were split into primary values for grouping, 11 numeric columns were coerced from mixed-type strings, and three supplementary files were merged to add frontier, notable, and large-scale flags. Log-transformed versions of Parameters and Training Compute were derived given the extreme right skew (parameters span 10 to 10¹⁴). The final cleaned dataset contains 3,305 rows and 76 columns. Full methodology is documented in the data cleaning report.

The Exponential Surge

Research Question 1 How has the volume and domain composition of AI model publications evolved from 1950 to 2025, and has the relationship between model scale and training compute remained consistent over time?

The history of AI is not a smooth curve; it is a series of inflection points, each triggered by a methodological breakthrough that unlocked a new order of scale. The dataset makes this visible in a way that narrative accounts often obscure.

From 1950 to 2016, the field produced a few hundred models in total. Then, in 2017, the Transformer architecture arrived. Model publication counts jumped sharply and have not stopped since. By 2023, the year ChatGPT normalized public interaction with large language models, the annual count reached 521. In 2024, it nearly doubled again to 944.

Key pattern: The Language domain accounts for 47% of all models in the dataset, the clearest single indicator of where the field's gravitational center has been since 2017. Biology (12%) and Vision (11%) are distant second and third.

The surge is not uniform across domains. Biology, driven largely by protein structure prediction following AlphaFold, emerged as a serious second tier. Vision models scaled steadily. Multimodal systems, which barely existed as a category before 2020, now represent a meaningful share of new releases. The field has not just grown; it has diversified.

Stacked bar chart showing number of AI models published per year by domain from 1950 to 2025, with sharp growth after 2017 — **Figure: AI model publications per year by domain, 1950–2025.** Language models account for the largest share throughout the post-2017 surge. Biology, Vision, and Multimodal contribute meaningful volumes from 2020 onward. The 2017 and 2023 inflection points mark the Transformer era and the ChatGPT-driven public AI boom respectively.

When Scaling Laws Broke Down

Supporting analysis for Research Question 1 — the scaling dimension.

For most of the deep learning era, there was a simple rule: more parameters required proportionally more compute to train. The relationship between model size and training FLOPs followed a predictable power law, and the implication was clear: if you wanted a better model, you needed a bigger model trained on more compute.

That rule held until around 2020. Then it started to crack.

Scatter plots showing compute vs parameters pre- and post-2020, with regression trend lines — **Figure 1: Scaling law divergence, pre- and post-2020.** Each dot is a model. The trend line slope quantifies the compute-per-parameter relationship. The post-2020 panel shows a flatter slope and wider spread, indicating that recent models achieve comparable capability with less compute per parameter than their predecessors.

The 2022 Chinchilla paper by DeepMind was the clearest articulation of what the data had been showing: GPT-3-scale models were significantly undertrained relative to their parameter counts. The efficient frontier was not a single line; it was a region, and most large models were nowhere near it.

The post-2020 scatter tells that story. The slope between log(compute) and log(parameters) flattens noticeably, and the spread widens. Academia, constrained by smaller compute budgets, tracks the efficient frontier more tightly. Industry, with room to experiment, produces both the most compute-hungry models and some of the most efficient ones.

What this means: The era of "just add more parameters" is over, at least as the only strategy. Techniques like mixture-of-experts, better data curation, and improved training recipes have made the relationship between scale and capability significantly more complex.

The Geography of Intelligence

Research Question 2 Which countries are producing AI models, and is geographic concentration at the frontier increasing or spreading over time?

If you look at model counts alone, the story seems simple: the United States dominates, followed by China, the United Kingdom, South Korea, Canada, and France. But model counts are a blunt instrument. The more interesting question is what kind of models each country produces, and whether geographic concentration is increasing or decreasing over time.

The US-China gap is real but narrowing. In 2018, US organizations were responsible for roughly 60% of models in the dataset. By 2024, that share had dropped as Chinese labs including Alibaba, ByteDance, DeepSeek, and Baidu, which emerged as serious contributors, particularly in the Language and Vision domains. The UK punches above its weight in Biology (DeepMind's AlphaFold lineage). South Korea's strength is concentrated in Language and Multimodal systems.

Interactive: AI model output by country, 2010–2025. Hit the play button on the Year slider to animate. Each bubble is sized by model count and colored by primary domain. The US dominates early; China's labs emerge around 2017; Europe and East Asia fill in through 2024. Open full screen ↗

The concentration question has a more ambiguous answer than the headlines suggest. Among all models, the US share fell from roughly 60% in 2018 to 38% in 2024 as other countries scaled output. Among frontier-class models specifically, however, the picture is far more concentrated: the United States accounts for two-thirds of all frontier releases, with the UK (7%) and China (4%) trailing significantly. Everything else has democratized. The barrier to entry for publishing a Language model has dropped dramatically; the barrier to training a frontier one has not.

A chi-squared test of independence on the top-10 country by top-10 domain contingency table (n = 2,845) confirms that the relationship between a model's country of origin and its research domain is statistically significant (χ²(81) = 1,297, p < 0.001). The effect size (Cramér’s V = 0.23) is modest but meaningful: country of origin is a real predictor of domain specialization, not just a reflection of overall output volume. China is notably overrepresented in Biology and Vision relative to its share of Language models; the United States dominates Language but produces a more balanced domain portfolio overall.

Open vs. Closed: The Access Divide

Research Question 3 What factors most strongly predict whether a model releases its weights publicly, and how has the open vs. closed balance shifted from 2012 to 2025?

The question of whether AI model weights should be publicly released is one of the most contested in the field. The arguments are familiar: open weights enable research, accelerate innovation, and democratize access; closed weights protect commercial interests, allow for safer deployment, and prevent misuse. What does the data actually show?

Line chart showing percentage of open-weight models per year from 2012 to 2025 — **Figure 2: Open-weight models as a share of annual releases (2012–2025).** The shaded area shows the percentage of new models each year that released weights publicly. The 2023 surge corresponds to Meta's Llama 2 release and the broader open-source wave it catalyzed.

The open-source wave of 2023, driven by Llama 2, Mistral, and a wave of derivatives, was real and statistically significant. It did not sustain beyond 2023: the open proportion declined to 43% in 2024 before partially recovering to 51% in 2025. Total model counts continued to surge throughout this period, meaning the closed-model share, dominated by industry labs, has grown substantially in absolute terms.

Horizontal bar chart of logistic regression coefficients predicting open model weights — **Figure 3: What predicts open model weights: logistic regression.** Coefficients are standardised, so magnitude is directly comparable. Positive values predict open weights; negative values predict closed. Model: 5-fold cross-validated logistic regression (AUC shown in chart).

The logistic regression results are instructive. Being an industry organization is the single strongest predictor of closed weights, stronger than model size, compute, or domain. Academic organizations are significantly more likely to release openly. Frontier status is a moderate negative predictor: the models most worth having are the ones least likely to be shared.

The access paradox: Publication year is a positive predictor of openness: newer models are somewhat more likely to be open. But the models driving this trend are mid-scale systems, not frontier ones. The gap between what's available to researchers and what's actually being used in production at the frontier is arguably wider today than it was in 2020.

Four Tiers of AI

Research Question 4 Do AI models cluster into distinct resource tiers, or do they exist on a smooth spectrum? How does domain composition differ across tiers, and what factors best predict a model's notable impact on the field?

There is a common assumption that AI models exist on a smooth spectrum from small to large. The data suggests otherwise. When you apply K-means clustering to log-transformed resource profiles (parameters, compute, dataset size, and hardware quantity), four distinct tiers emerge with surprisingly clean boundaries.

PCA scatter of AI model resource clusters showing four distinct resource tiers; domain composition by tier on the right — **Figure 4: Four natural resource tiers in AI development.** *Left:* PCA projection of log-scale resource features, colored by cluster. *Right:* Domain composition within each tier; Language dominates all tiers but is especially concentrated at the frontier.

The tiers are intuitive once you see them. Research-scale models, the kind a well-funded academic lab can train, cluster tightly at the low end. Production-scale models, the kind deployed in real products, form a middle band. Frontier-class models occupy their own stratum, and a small cohort of true frontier models sit at an extreme that looks almost disconnected from the rest.

The domain composition is telling. Biology models are almost entirely in the research-scale and mid-scale tiers. AlphaFold notwithstanding, the field does not require GPT-scale compute budgets. Multimodal models, by contrast, are disproportionately represented at the top two tiers, reflecting the added cost of processing multiple data modalities simultaneously.

Language vs. Multimodal: Converging at the Top

Supplementary analysis: domain convergence trends over time.

Multimodal models, systems that process text alongside images, audio, video, or other modalities, have been the defining architectural trend of the 2022–2025 period. GPT-4V, Gemini, Claude 3, and their contemporaries represent a qualitative shift from the language-only era. The question is whether multimodal models are following the same scaling trajectory as language models, or charting a different path.

Side-by-side charts showing parameter growth and publication rate for Language vs Multimodal models — **Figure 5: Language vs. Multimodal scaling trajectories (2015–2025).** *Left:* Median parameter count over time (lines) with individual models (dots). The gap between domains narrows significantly post-2022. *Right:* Annual publication counts; multimodal release rates accelerated sharply in 2023.

The convergence story is real but partial. At the median, multimodal models have rapidly closed the parameter gap with language models; by 2024, the distributions overlap substantially. At the top end, however, the largest language models still eclipse the largest multimodal ones in raw parameter count. Adding modalities turns out to require more compute per parameter, not just more parameters.

Multimodal models are not replacing language models. They are consuming them, using a language backbone and adding modality-specific components on top. The two trajectories are converging because multimodal systems are increasingly built from language models.

The publication rate chart tells a different story: multimodal releases have accelerated faster than language ones in relative terms. The field's attention has shifted, even if the absolute parameter counts have not yet caught up.

The Price of Intelligence

Supplementary analysis: economics of model training.

Training cost is the least-reported column in the dataset. Only 7% of models include a cost figure, which is itself a data point. Labs are not eager to publish how much they spend. But for the models where cost is known, the numbers tell a clear and sobering story.

Two panels: Ridge regression coefficients for cost drivers on the left; cost vs compute scatter on the right — **Figure 6: What drives training cost?** *Left:* Ridge regression coefficients on standardised features. Compute is the dominant predictor (CV R² shown in title). *Right:* Cost vs compute scatter colored by organization type, with overall trend line. The correlation is tight but not perfect; hardware efficiency and cloud pricing introduce meaningful variance.

Compute is, by a wide margin, the strongest predictor of training cost. This is expected: more FLOPs means more GPU-hours means higher bills. But the relationship is not deterministic. Hardware quantity and organization type both add independent explanatory power. Industry organizations spend more for a given compute budget than academic ones, consistent with the hypothesis that they are training on proprietary infrastructure with different cost structures than academic compute grants.

The median training cost for models with known costs is around $27,000, accessible to a serious research group. The 95th percentile exceeds $10 million. The top of the distribution, anchored by GPT-4-scale systems, approaches $400 million. That range spans five orders of magnitude. The practical implication is clear: fewer organizations can credibly train a competitive model today than could a decade ago.

The cost trajectory: Even adjusting for hardware improvements (which have driven down cost-per-FLOP roughly 2× every two years), the raw dollar cost of training frontier models has grown faster than efficiency gains can offset. The frontier is not getting cheaper to reach; it is getting more expensive.

What Actually Makes a Model Notable?

Supporting analysis for Research Question 4 — predictors of model impact.

The dataset includes a curated subset of 981 notable models, selected by Epoch AI based on citation impact, historical significance, and research influence. This raises a question: is notability just a proxy for size, or do other factors matter independently?

Horizontal bar chart of random forest feature importances for predicting model notability — **Figure 7: Predicting notability: random forest feature importances.** A model trained on numeric and categorical features to classify notable vs. non-notable models. Higher bars indicate stronger predictive contribution. Cross-validated AUC is shown in the chart.

The random forest results challenge the assumption that bigger is always more notable. Parameter count is a predictor, but it is not the dominant one. Publication year is the single strongest feature, which reflects the expansion of the dataset itself more than anything about the models. More interestingly, open weights and frontier status both contribute independently, suggesting that models which shaped the field are disproportionately the ones that were released publicly.

The models that changed AI were not always the largest. BERT (340M parameters, 2018) catalyzed a decade of transfer learning research. AlphaFold 2 (93M parameters) solved protein folding. GPT-2 (1.5B parameters) sparked the open-source debate before GPT-3 existed. Notability is about what a model enables, not just what it measures.

The Family Tree

One of the more underappreciated patterns in the dataset is the lineage structure. The Base model column records which existing model a given system was built on, through fine-tuning, instruction-tuning, or architectural extension. This creates a directed network with 3,305 nodes and 722 edges.

Force-directed network graph of AI model lineage — 3,305 nodes and 722 edges colored by category (frontier in blue, notable in green, standard in grey), with hub models like Llama 2 and Mistral 7B labeled — **Figure: AI Model Lineage Network (ForceAtlas2 layout).** 3,305 nodes, 722 directed edges. Node color: blue = frontier + notable, green = notable, grey = standard. The Llama 2 family (7B, 13B, 70B) anchors the largest cluster; Mistral 7B, ESM2, and SigLIP anchor biology and vision lineages respectively.

The network is highly centralized around a small number of "parent" models. The Llama 2 family alone (7B, 13B, 70B) spawned 55 documented derivatives in this dataset. Mistral 7B, released in September 2023, generated 14 within months. This is the open-source flywheel in action: a publicly released model becomes a platform that multiplies research productivity for the entire community.

Closed models appear in the network as isolated nodes or small clusters. They receive no derivatives, at least none that can be traced. This is the structural consequence of the access divide: open models compound in value; closed ones do not.

Conclusion

Seventy-five years of AI development, compressed into a single dataset, tells a story more nuanced than the headlines suggest. The field has not just grown; it has stratified. There are now four resource tiers, a widening gap between open and closed systems, and a geographic concentration at the frontier that coexists with genuine democratization at the mid-scale.

The efficiency story is real: post-Chinchilla models do more with less compute per parameter than their predecessors. But efficiency gains have not made the frontier cheaper to reach. They have made it possible for the frontier to advance faster, requiring even more compute to stay current.

The open-source wave of 2023 was meaningful. The model lineage network shows, concretely, how Llama 2 and Mistral generated dozens of derivatives that would otherwise not exist. Whether that wave continues, or whether proprietary models reestablish dominance at the frontier, is probably the most consequential open question in AI policy right now.

The data cannot answer that question. But it can make the stakes legible.

Dataset & Methods

Data: Epoch AI, "Data on AI Models," 2026. Available at epoch.ai/data/ai-models. Licensed under CC-BY 4.0. The cleaned dataset contains 3,305 models and 76 columns after deduplication, type coercion, primary-value extraction, and auxiliary-dataset merging. All statistical analyses use scikit-learn 1.6 and scipy 1.15. Visualizations generated with matplotlib 3.10 and seaborn 0.13. Network analysis via Gephi 0.10.