Foundation models are a revolutionary type of artificial intelligence (AI) model, distinguished by their ability to generate a wide array of outputs across diverse applications. They are versatile tools capable of performing various tasks, including text generation, image creation, and audio synthesis. They can function as independent systems or serve as a foundational layer for numerous other AI applications.
Researchers define foundation models based on their expansive capabilities, broad range of applications, and ability to handle diverse tasks and outputs. Some foundation models work with a single type of input, such as text, while others are multimodal, processing multiple input types like text, images, and video to produce various outputs, such as summarizing text, generating images, or answering questions.
The term “foundation model” gained prominence in 2021, thanks to research at the Stanford Institute for Human-Centered Artificial Intelligence, in collaboration with the Stanford Center for Research on Foundation Models. These researchers described foundation models as “models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks.”
An illustration depicting the supply chain of foundation models, highlighting the various stages and actors involved in their development and deployment.
Foundation models power many popular applications, including OpenAI’s ChatGPT, Microsoft’s Bing, and website chatbots. They also underpin image generation tools like Midjourney and Adobe Photoshop’s generative fill tools. For instance, ChatGPT is built on the GPT-3.5 and GPT-4 families of foundation models, which also serve as the base for many other applications, such as Bing Chat and Duolingo Max.
A key characteristic of foundation models is the vast scale of data and computational resources needed to create them. They require datasets comprising billions of words or hundreds of millions of images collected from the internet. Foundation models also rely on transfer learning, applying learned patterns from one task to another.
Current foundation models can translate and summarize text, generate reports from notes, draft emails, respond to queries, and create new text, images, audio, or visual content based on prompts.
Companies can access a foundation model (downstream in the supply chain) to build AI applications “on top” of it, using a local copy or an application programming interface (API). “Downstream” refers to activities following the foundation model’s launch and activities that build upon it. After OpenAI launched GPT-4, it allowed companies to build products based on it, including Microsoft’s Bing Chat, Virtual Volunteer by Be My Eyes, and educational apps like Duolingo Max and Khan Academy’s Khanmigo.
Foundation model providers often allow downstream companies to create customized versions through “fine-tuning,” where new data is added to improve performance and capabilities on specific tasks.
Foundation Models vs. Narrow AI
Foundation models differ from narrow AI models, which are designed for specific tasks and are not intended for use beyond their original purpose. Narrow AI applications are trained on specific data for a specific context and are not designed for reuse. For example, a bank’s model for predicting loan applicant default risk cannot serve as a customer service chatbot.
It’s important to note that both narrow AI models and foundation models can be unimodal (receiving input based on one content type and generating only text or images) or multimodal (capable of receiving input and generating content in various modes, such as text-to-image or robotic arm manipulation).
Other Terms for Foundation Models
Besides “foundation model,” other terms like “generative AI” and “large language models (LLMs)” are used to describe similar models.
Generative AI
Generative AI systems generate content based on user inputs like text prompts. The content can include images, video, text, and audio. Some generative AI systems are unimodal, while others are multimodal.
It is important to note that not all generative AI are foundation models. Generative AI can be narrowly designed for a specific purpose. Some generative AI applications, like OpenAI’s DALL·E or Midjourney, are built on top of foundation models, using natural language text prompts to generate images. Generative AI capabilities include text manipulation, image, video, and speech generation, and applications include chatbots, photo and video filters, and virtual assistants. Generative AI tools are not new, and they are not always built on top of foundation models. For example, generative adversarial networks (GANs) that power many Instagram photo filters and deepfake technologies have been in use since 2014.
Large Language Models (LLMs)
Language models are AI systems trained on text data that generate natural language responses to inputs or prompts. They are trained on “text prediction tasks,” predicting the likelihood of a character, word, or string based on context. This is used in applications like SMS, Google Docs, and Microsoft Word.
Large language models (LLMs) typically have hundreds of millions (or billions) of parameters and are pretrained using billions of words of text, using a transformer neural network architecture. LLMs perform a wide range of text-based tasks like question-answering, translation, and summarization. Increasingly, these models are multimodal, using multiple inputs and generating multiple outputs. For example, GPT-4 can use both text and images as input.
Google’s PaLM-E, an embodied multimodal language model, is capable of visual tasks (such as describing images, detecting objects or classifying scenes), and robotics tasks (such as moving a robot through space and manipulating objects with a robotic arm).
More Contested Terms
Frontier Models
“Frontier models” are a type of AI model within the broader category of foundation models. The term is used by industry, policymakers, and regulators. While there is no consistent definition, it refers to cutting-edge, powerful models with newer or better capabilities than other foundation models. The computational resources needed to train the model are sometimes used as a proxy.
Artificial General Intelligence (AGI)
Artificial general intelligence (AGI) and “strong” AI refer to AI systems capable of any task a human could undertake. These are contested terms as they describe an aspirational rather than a current AI capability. OpenAI and Google DeepMind have stated ambitions to build AGI, but no current AI models can be defined as AGI.
Researchers at Microsoft define AGI as “systems that demonstrate broad capabilities of intelligence, including reasoning, planning, and the ability to learn from experience, and with these capabilities at or above human-level.”
ISO/IEC defines AGI as “A type of AI system that addresses a broad range of tasks with a satisfactory level of performance” and “systems that not only can perform a wide variety of tasks, but all tasks that a human can perform.”
AGI contrasts with most AI systems today, which are “artificial narrow intelligence” (ANI) or “weak” AI, trained to perform specific tasks within a predefined environment.
Conclusion
Foundation models represent a significant advancement in AI, offering versatility and broad applicability. Understanding what a model is, including its characteristics, differences from narrow AI, and related terminology, is crucial for navigating the evolving landscape of artificial intelligence.