Today, Meta announced the release of Llama 3.1 405B, a model containing 405 billion parameters. Parameters roughly correspond to the model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters. With 405 billion parameters, Llama 3.1 405B is not the largest open-source model, but it is the largest in years. Trained on 16,000 Nvidia H100 GPUs, it uses new training and development technologies, which Meta claims make it competitive with leading proprietary models such as OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet (with some caveats).
New and Improved
Like previous Meta models, Llama 3.1 405B is available for download and use on cloud platforms such as AWS, Azure, and Google Cloud. It is also used by WhatsApp and Meta.ai, where it powers a chatbot for American users. Llama 3.1 405B can perform a range of tasks, from programming and answering basic math questions to document summarization in eight languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai). It works only with text, meaning it can’t, for example, answer questions based on an image, but it can handle most text-based tasks—like parsing PDF files and spreadsheets.
Meta wants to make it clear that it is experimenting with multimodality. In a paper published today, the company’s researchers write that they are actively developing Llama models that can recognize images and videos and understand (and generate) speech. However, these models are not yet ready for public release.
To train Llama 3.1 405B, Meta used a dataset of 15 trillion tokens acquired through 2024 (tokens are parts of words that are easier for the model to learn than whole words, and 15 trillion tokens is a mind-boggling 750 billion words). This isn’t a new training set per se, as Meta used it to train previous Llama models, but the company says that in developing this model, it has improved its data curation pipelines and adopted “more rigorous” approaches to data quality and filtering.
The company also used synthetic data (data generated by other AI models) to fine-tune Llama 3.1 405B. Most major AI providers, including OpenAI and Anthropic, are exploring the use of synthetic data to scale up AI training, but some experts believe synthetic data should be used last because of its potential to increase model bias.
Meta insists it is “carefully weighing” the Llama 3.1 405B training data but declined to say where exactly it was taken from (other than web pages and public web files). Many generative AI vendors view training data as a competitive advantage and therefore keep it and any information related to it under lock and key. But details of training data are also a potential source of intellectual property lawsuits, another disincentive for companies reluctant to disclose much information.
Meta researchers write that compared to previous Llama models, Llama 3.1 405B was trained on more non-English data (to improve performance in non-English languages), more “math data” and code (to improve the model’s mathematical thinking skills), and the latest web data (to improve knowledge of current events).
A recent Reuters report revealed that Meta at one point used copyrighted e-books to train AI, despite warnings from its lawyers. Additionally, the company trains its AI on Instagram and Facebook posts, photos, and captions, making it difficult for users to opt out. Moreover, Meta, along with OpenAI, is the subject of a lawsuit filed by authors including comedian Sarah Silverman over the companies’ alleged unauthorized use of copyrighted data to train models.
“Training data, in many ways, is kind of the secret recipe and sauce that goes into building these models,” Ragavan Srinivasan, vice president of AI program management at Meta, told TechCrunch. “From our point of view, we have invested a lot in this. And it will be one of those things that we continue to improve.”
More Context and Tools
Llama 3.1 405B has a larger context window than previous Llama models: 128,000 tokens, which is roughly the size of a 50-page book. A model’s context, or context window, is the input (such as text) that the model considers before generating output (such as additional text).
One of the advantages of high-context models is that they can summarize long pieces of text and files. When used in chatbots, such models are also less likely to forget topics that were recently discussed.
Two other new, more compact models announced by Meta today, the Llama 3.1 8B and Llama 3.1 70B—updated versions of the Llama 3 8B and Llama 3 70B released in April—also feature 128,000 token context windows, notes AppTractor. In previous models, context was no more than 8,000 tokens, making this update quite significant—assuming Llama’s new models can reason effectively based on all of that context.
All Llama 3.1 models can use third-party tools, applications, and APIs to perform tasks, just like competing models from Anthropic and OpenAI. Out of the box, they are trained to use Brave Search to answer questions about the latest events, the Wolfram Alpha API for math and science queries, and a Python interpreter for code review. Additionally, Meta claims that Llama 3.1 models can use some tools they haven’t encountered before—to a certain extent.
If the benchmarks are to be believed (and benchmarks aren’t the most important thing in generative AI), the Llama 3.1 405B is a very capable model indeed. That’s not a bad thing, considering some of the painfully obvious shortcomings of the previous generation Llama models.
Llama 3 405B performs on par with OpenAI’s GPT-4 and shows “mixed results” compared to GPT-4o and Claude 3.5 Sonnet, according to people hired by Meta, the article notes. While Llama 3 405B is better at executing code and generating graphs than GPT-4o, its multi-language capabilities are generally weaker, and Llama 3 405B lags behind Claude 3.5 Sonnet in programming and general reasoning.
Also, due to its size, it requires powerful equipment to operate. Meta recommends at least a server node.
This may be why Meta is promoting its smaller new models, the Llama 3.1 8B and Llama 3.1 70B, for general-purpose applications such as chatbots or code generation. Llama 3.1 405B, the company says, is better suited for model distillation—the process of transferring knowledge from a large model to a smaller, more efficient model—and creating synthetic data to train (or fine-tune) alternative models.
To encourage the use of synthetic data, Meta said it has updated the Llama license to allow developers to use the results of the Llama 3.1 family of models to create third-party generative AI models (whether this is advisable is debatable). It’s important to note that the license still limits developers’ ability to deploy Llama models. Developers of apps with more than 700 million monthly users must request a special license from Meta, which the company will grant at its discretion.
This change in output licensing, which addresses a major criticism of Meta models in the AI community, is part of the company’s aggressive push into generative AI.
With the Llama 3.1 family, Meta is releasing what it calls a “reference system” and new security tools—some of which block hints that could cause Llama models to behave unpredictably or undesirably—to encourage developers to use Llama in more places. Additionally, the company is previewing and soliciting comments on Llama Stack, an upcoming API for tools that can be used to fine-tune Llama models, create synthetic data using Llama, and create agent apps—Llama-based applications that can take actions on behalf of the user.
“We have repeatedly heard from developers that they want to know how to deploy Llama models to production,” says Srinivasan. “So we try to give them a lot of different tools and opportunities.”
Game for Market Share
In an open letter published this morning, Meta CEO Mark Zuckerberg laid out a vision for a future in which AI tools and models get into the hands of more developers around the world, giving people access to the “benefits and capabilities” of AI. This is very disinterested language, but in the letter, Zuckerberg expresses his desire for these tools and models to be created by Meta.
Meta is looking to catch up with companies like OpenAI and Anthropic and is using a proven strategy: give away tools for free to grow the ecosystem and then gradually add products and services, including paid ones. By spending billions of dollars on models that can then be commercialized, the company undercuts competitors’ prices and distributes its version of AI widely. It also allows the company to incorporate improvements from the open-source community into its future models.
Llama is certainly attracting the attention of developers, adds NIX Solutions. Meta claims that Llama models have been downloaded more than 300 million times, and more than 20,000 models based on Llama have been created so far.
But make no mistake, Meta plays to hold. It spends millions lobbying regulators to agree to its preferred brand of “open” generative AI. None of Llama 3.1’s models solve the intractable problems of current generative AI technologies, such as their tendency to hallucinate and reuse problematic training data. But they advance one of Meta’s key goals: to become synonymous with generative AI.
You have to pay for this. In the research paper, the co-authors, echoing Zuckerberg’s recent comments, discuss reliability issues related to power consumption when training Meta’s ever-growing generative AI models. “During training, tens of thousands of GPUs may simultaneously increase or decrease power consumption, for example, due to all GPUs waiting for a checkpoint to complete or to collectively communicate, start, or shutdown the entire training job,” they write. “When this happens, it can cause instantaneous fluctuations in power consumption across the entire data center on the order of tens of megawatts, limiting the capacity of the power grid. This is an ongoing challenge for us as we scale training for future, even larger Llama models.”
We’ll keep you updated on further developments and advancements in Meta’s AI endeavors.