Categories: DatasetDiverse

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

What is the Pile?

The Pile is a 825 GiB diverse, open
source language modelling data set that consists of 22 smaller,
high-quality datasets combined together.


Have a model that uses or evaluates on the Pile?
Let us know!

Why is the Pile a good training set?

Recent work has shown that especially for large models, diversity
in data sources improves general cross-domain knowledge of the
model, as well as downstream generalization capability. In our
evaluations, not only do models trained on the Pile show moderate
improvements in traditional language modeling benchmarks, they
also show significant improvements on Pile BPB.

Why is the Pile a good benchmark?

To score well on Pile BPB (bits per byte), a model must be able to
understand many disparate domains including books, github
repositories, webpages, chat logs, and medical, physics, math,
computer science, and philosophy papers. Pile BPB is a measure of
world knowledge and reasoning ability in these domains, making it
a robust benchmark of general, cross-domain text modeling ability
for large language models.


indicates potential test-set overlap. Zero-shot indicates that
not all of the components of the Pile were present in the training

Rank Model Test BPB


Jan 1.2021

GPT-3 (Zero-Shot)*




Jan 1.2021

GPT-2 (Zero-Shot)*



Read More

News Bot

Published by
News Bot

Recent Posts

Supreme Court’s Sonia Sotomayor Says Trump Admin Executions Are ‘Not Justice’

Dustin Higgs was the 13th and final person put to death since the Trump administration…

14 mins ago

8 Mega Millions Players Just Missed on Friday, Jackpot Now Hits $850 Million

Eight different players came oh, so close. But the jackpot now increases even higher in…

14 mins ago

U.S. COVID Vaccine Rollout ‘Extremely Poor’—But Some States Buck the Trend

Experts told Newsweek the U.S. has tripped up at arguably the most important hurdle of…

14 mins ago

Bill Gates, Sustainable Agriculture Champion, Is America’s Biggest Farmland Owner

With over 242,000 acres of farmland, Bill Gates is America's biggest farmland owner, and his…

14 mins ago

States Closing Capitol Buildings In Advance of Nationwide Inauguration Protests

Officials in many states have announced that capitol buildings will be closed and security measures…

14 mins ago

Who Is Barry Berke? Dems’ Lead Counsel Named For Trump’s Second Impeachment Trial

The House Judiciary Committee announced Friday that lawyer Barry Berke will be serving as a…

14 mins ago