Did you know that you can navigate the posts by swiping left and right?

Towards AGI - ChatGPT and GPT-4

19 Mar 2023 . category: tech .
#chatgpt #gpt #agi #llm #nlp #deep-learning #transformer #rlhf

With the development of modern LLMs (i.e. Large Language Models), we've reached a turning point where carefully trained models are more knowledgeable than most human beings in many regards. The emergence of ChatGPT and the recently published GPT-4 sheds light on building Artificial General Intelligence in a feasible way In this blog, I would like to briefly introduce the GPT series and discuss their current limitations and future applications.

GPT-1: Improving Language Understanding by Generative Pre-Training

This is the first work to introduce the Pre-Training - Fine-tuning paradigm into the NLP domain. Published by OpenAI in 2018, GPT-1 demonstrated that a Transformer decoder trained on a large corpus of unlabeled text could learn useful linguistic representations. The key insight was simple yet powerful: first, pre-train a language model to predict the next token on a massive amount of text (BooksCorpus, ~7000 unpublished books); then, fine-tune the model on specific downstream tasks with a small amount of labeled data.

The architecture was a 12-layer Transformer decoder with 117M parameters. Despite being relatively small by today's standards, GPT-1 achieved state-of-the-art results on 9 out of 12 NLP benchmarks at the time, including natural language inference, question answering, and text classification. The takeaway: unsupervised pre-training + supervised fine-tuning is a recipe that works remarkably well.

GPT-2: Language Models are Unsupervised Multitask Learners

In 2019, OpenAI scaled things up. GPT-2 used 1.5B parameters (roughly 10x GPT-1) and was trained on WebText, a dataset of ~8 million web pages curated by following outbound links from Reddit posts with at least 3 karma. The quality of training data turned out to matter a great deal.

The most exciting finding of GPT-2 was zero-shot task transfer: without any fine-tuning, the model could perform tasks it was never explicitly trained for — summarization, translation, and question answering — simply by conditioning on natural language prompts. For example, feeding it "TL;DR:" after a passage would produce a reasonable summary. This hinted at an intriguing property: sufficiently large language models can implicitly learn to perform tasks as a byproduct of learning to predict the next word.

GPT-2 also made headlines for a different reason — OpenAI initially chose not to release the full model, citing concerns about potential misuse for generating fake news and spam. This sparked an important debate about responsible AI release practices that continues to this day.

GPT-3: Language Models are Few-Shot Learners

GPT-3, published in 2020, was a massive leap. With 175B parameters (over 100x GPT-2) trained on a filtered version of Common Crawl plus books and Wikipedia, GPT-3 demonstrated a remarkable capability: in-context learning. Instead of fine-tuning the model on task-specific data, you could simply provide a few examples in the prompt (few-shot), a single example (one-shot), or just a task description (zero-shot), and the model would figure out what to do.

Technically, GPT-3 followed the same Transformer decoder architecture as its predecessors but introduced alternating dense and locally banded sparse attention patterns in certain layers. The real breakthrough, however, was the scaling law: the authors observed smooth, predictable improvements in performance as model size, data, and compute increased. This empirical finding has since become a guiding principle for the field — sometimes you don't need a new algorithm, you just need to scale up.

GPT-3 also revealed some uncomfortable truths. The model could generate biased, toxic, or factually incorrect content. It had no mechanism to say "I don't know." And its few-shot performance, while impressive, was brittle — small changes in prompt formatting could lead to wildly different outputs.

InstructGPT and ChatGPT: Aligning Language Models with Human Intent

The jump from GPT-3 to ChatGPT is arguably more about alignment than architecture. InstructGPT (early 2022) introduced a three-step recipe that would prove transformative:

Supervised Fine-Tuning (SFT): Train the model on high-quality demonstrations written by human labelers.
Reward Modeling: Train a separate model to predict which of two outputs a human would prefer.
Reinforcement Learning from Human Feedback (RLHF): Use Proximal Policy Optimization (PPO) to fine-tune the language model against the reward model.

The result was striking: a 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B GPT-3. This showed that alignment techniques could be more impactful than raw scale. The model became more helpful, less likely to produce harmful content, and much better at following instructions.

ChatGPT, released in November 2022, applied similar techniques to GPT-3.5 (a refined version of GPT-3) and wrapped it in a conversational interface. The rest is history — it reached 100 million users within two months, making it the fastest-growing consumer application ever. Suddenly, the general public could experience the power of large language models first-hand, and the AI discourse shifted from academic circles to dinner table conversations worldwide.

GPT-4: A Multimodal Leap

Released in March 2023, GPT-4 represents a significant step forward in several dimensions. While OpenAI disclosed very few technical details (no architecture specifics, no parameter count, no training data details — a controversial departure from the earlier GPT papers), the capabilities speak for themselves.

The most notable advancement is multimodality: GPT-4 can accept both text and images as input (though it outputs only text). It can describe images, answer questions about visual content, and reason about diagrams. This opens up a whole new range of applications, from assisting visually impaired users to analyzing medical images and interpreting complex charts.

In terms of reasoning, GPT-4 shows substantial improvement. It passes the bar exam in the top 10% of test takers (GPT-3.5 scored in the bottom 10%), and performs well on various academic and professional exams. It is also more calibrated in its confidence — better at knowing when it doesn't know — though it still hallucinates.

Under the hood, rumor and analysis suggest GPT-4 uses a Mixture-of-Experts (MoE) architecture, where multiple smaller "expert" networks are selectively activated for different inputs. This allows for greater model capacity without proportional increases in computation per token. While unconfirmed by OpenAI, this approach would explain how GPT-4 achieves superior performance while maintaining reasonable inference costs.

Current Limitations

Despite the rapid progress, current LLMs still face fundamental challenges:

Hallucination: LLMs generate fluent but sometimes fabricated information with high confidence. They have no grounded understanding of truth — they model statistical patterns of language, not facts about the world. This is perhaps the most critical barrier to deployment in high-stakes domains like medicine and law.
Reasoning: While GPT-4 shows improved reasoning, LLMs still struggle with multi-step logical reasoning, mathematical proofs, and tasks requiring systematic planning. They are pattern matchers at heart, and their “reasoning” can break down in novel situations that fall outside their training distribution.
Context window: Current models have limited context windows (e.g., 4K-32K tokens), which restrict their ability to process long documents or maintain very long conversations. Though this is being actively addressed (GPT-4 already offers a 32K token variant), it remains a practical limitation.
Temporal knowledge: LLMs have a knowledge cutoff and cannot access real-time information. They don’t know what happened yesterday unless augmented with external tools or retrieval mechanisms.
Cost and efficiency: Training and serving these models requires enormous computational resources. GPT-4 training reportedly cost over $100M. This creates barriers to entry and raises questions about the environmental impact and equitable access.

Looking Ahead

The trajectory from GPT-1 to GPT-4 — spanning just five years — is remarkable. We've gone from a 117M parameter model that needed fine-tuning for every task, to a multimodal system that can pass professional exams and write working code from a screenshot of a hand-drawn sketch.

Several directions seem particularly promising:

Tool use and augmentation: Teaching LLMs to use external tools (calculators, search engines, code interpreters) to compensate for their inherent limitations. This is already happening with ChatGPT plugins and could dramatically expand what these systems can do.
Multimodal expansion: GPT-4’s vision capabilities are just the beginning. Future models may seamlessly integrate text, images, audio, video, and other modalities, moving closer to how humans perceive and interact with the world.
Alignment and safety: As these models become more capable, ensuring they remain aligned with human values becomes increasingly important. Techniques like RLHF are a good start, but we likely need more robust approaches as capabilities grow.
Domain-specific applications: Adapting foundation models for specialized fields (science, medicine, law, engineering) where they can serve as powerful assistants to domain experts, rather than replacing them.

Are we on the path to AGI? It's hard to say. What's clear is that we've entered a new era where AI can understand and generate human language at a level that would have seemed like science fiction just a few years ago. The challenge now is to harness this capability responsibly, address the real limitations, and build systems that genuinely benefit humanity. The journey has just begun

Xin Guo is a Principal Research Scientist at the Shanghai AI Laboratory, focusing on AI for life sciences, multimodal scientific foundation models, scientific agentic AI, and automated labs for closed-loop discovery. His work aims to build AI systems that connect biological data, scientific reasoning, and experimental discovery. He has studied and worked in Switzerland, Germany, and the UK. Outside research, he enjoys coffee, Swiss chocolate, badminton, and basketball.