Pulpie: Pareto-Optimal Models for Cleaning the Web
We’re introducing Pulpie, a family of Pareto-optimal models for extracting main content from HTML pages. Pulpie approaches SOTA extraction quality at one twentieth the cost.
Our smallest model, pulpie-orange-small, scores 0.862 ROUGE-5 F1 on WebMainBench. This matches Dripper, the leading extractor, which scores 0.864. Pulpie’s performance is despite it being a third the size: 210M parameters versus Dripper’s 600M.
The gains come from architecture. Pulpie is an encoder that labels every HTML block as content or boilerplate in a single forward pass. This also makes it fast.
On an NVIDIA L4 GPU, pulpie-orange-small processes 13.7 pages/sec against Dripper’s 0.68 pages/sec. At $0.39/hr for an L4 instance, cleaning 1 billion pages costs $7,900 with Pulpie and $159,000 with Dripper.
Pulpie unlocks high quality web extraction at a scale impossible before. We expect this to benefit pre-training and context management.
Our models are open source and available on Hugging Face. See Get started for instructions.
Extraction is the bottleneck
Language models consume the web twice. First in pre-training, where they learn about the world. Then at inference, when they pull in relevant context. Both times the input is mostly noise. During discovery, we found 70% of the blocks on a typical HTML page hold boilerplate like navigation, ads, sidebars, and footers. Main content is only a small fraction of the page.
However, that fraction determines model quality on both ends.
AICC (Ma et al., 2025) measured the effect of cleaner extraction on pre-training. The team built two corpora from the same Common Crawl snapshot. One extracted content with heuristics. The other extracted it with a model-based parser. Everything else in the data pipeline remained equal. They then trained an identical model on each corpus.
The model trained on the model-extracted corpus scored 1.08 percentage points higher in average accuracy across 13 benchmarks. Since only extraction logic changed, we can attribute the gain entirely to having cleaner data.
Impressively, the same model also beat models trained on FineWeb and RefinedWeb, two of the most heavily filtered pre-training corpora. These datasets have earned their reputations through elaborate filtering and deduplication. Beating them by improving the extractor illustrates the high value of clean data.
Beyond setting a low baseline, poor extraction materially harms models. Heuristics break structured content. The table below shows how Trafilatura and model-based extractors compare on preserving code blocks and formulas. Low similarity scores indicate corruption. If used in training, resulting models will inherit this damage.
| Content | Trafilatura (heuristic) | Model-based |
|---|---|---|
| Code blocks | 0.13 | 0.91 |
| Formulas | 0.61 | 0.94 |
Data quality matters at inference too. Shi et al. (ICML 2023) showed that a single irrelevant passage is enough to derail a model’s answer. A model is more accurate and more efficient when its context is free of noise.
Cleaning on a budget
Cleaning the web pays off in both training and inference. The open question is how do we clean well at scale?
First, to understand the landscape, we can divide current extractors into two families based on the question: Does the method read the page, or inspect its structure?
Structure-based extractors judge an HTML block by surface signals. They apply rules over tags, DOM, and text density to separate content from boilerplate. Trafilatura, Readability, and magic-html work this way. Boilerpipe goes one step further and trains a classifier on those same signals. These extractors are easy to run but they confuse similarly built elements. A navigation table and a data table look identical to an algorithm counting cells.
Reading extractors feed the page to a transformer and label each block based on its content. Dripper is a decoder built on this idea. The decoder emits labels one token at a time. Each label forces the full model to be read from memory for a single step of work. This ties speed to memory bandwidth and makes runs expensive.
Pulpie keeps the reading approach but moves the bottleneck to compute. We do this by using an encoder architecture that labels every block in a single forward pass. This enables Pulpie to match Dripper’s quality while being smaller, faster, and cheaper.
Quality vs Cost of Web Content Extraction
Depulping raw HTML
The full pipeline runs in four stages:
- Simplify the HTML. Remove scripts, styles, and other formatting noise. Tag each block with a unique ID.
- Chunk the blocks. Split the blocks, tokenize them, and pack them into chunks of at most 8,192 tokens, so each chunk fits the model in one pass. About 80% of pages fit in a single chunk.
- Classify. Run a forward pass. Pulpie labels each block as content or boilerplate.
- Return. Return the kept blocks as HTML, or convert them to Markdown.
Training
Training Pulpie needed a large set of HTML pages with block-level labels. No such public set existed, so we built one.
We sampled 16,670 English pages from Common Crawl, limiting to one per domain. We then used MinerU-HTML to split each page into blocks, and labeled each block as content or boilerplate with DeepSeek V3.2. Further filtering removed empty, corrupted, and otherwise unfit pages, leaving 15,880.
We then ran Dripper 0.6B as a second labeler across all 15,880 pages to flag inconsistent labels. Block-level agreement with DeepSeek was 93.3%. We kept the 14,959 pages where the two labelers agreed on at least 70% of blocks, trading some data for a cleaner training set.
Teaching a teacher
To create our teacher model, we fine-tuned EuroBERT-2.1B on the aforementioned 14,959 pages.
| Setting | Value |
|---|---|
| Learning rate | 2e-5 |
| Effective batch size | 8 |
| Loss | Class-weighted cross-entropy |
| Hardware | 4x A100 |
Class weights are set inversely to the 28.6% content rate to counter the imbalance.
The teacher scored 0.873 ROUGE-5 F1 on the WebMainBench English set. At 2.1B parameters it is accurate but expensive to run, so we distilled it into smaller models.
Imparting knowledge
For a better production fit, we distilled the 2.1B teacher into two smaller models:
- Pulpie Orange Base, a 610M parameter encoder.
- Pulpie Orange Small, a 210M parameter encoder.
Both students learn from the teacher following Hinton et al. (2015). The teacher’s softened output distribution supplies most of the signal through a KL-divergence loss weighted 0.7, with hard-label cross-entropy making up the remaining 0.3, at temperature 2.0. Both train on the same data as the teacher.
The distilled models keep almost all of the teacher’s quality.
| Model | Parameters | ROUGE-5 F1 | vs. Teacher |
|---|---|---|---|
| Pulpie Orange Small | 210M | 0.862 | -1.1 F1 points |
| Dripper | 0.6B | 0.864 | -0.9 F1 points |
| Pulpie Orange Base | 610M | 0.863 | -1.0 F1 points |
| Pulpie Orange Large (teacher) | 2.1B | 0.873 | - |
Despite a tenfold cut in size, the 210M model is within one F1 point. Combined with its speed and cost benefits, pulpie-orange-small features the best size-to-quality ratio in the entire family. It is the model we recommend for production use.
Results
Quality
We measure ROUGE-5 F1 on the English subset of WebMainBench (6,647 pages across all difficulty levels). Empty extractions count as zero.
| Method | ROUGE-5 F1 | Empty pages |
|---|---|---|
| magic-html | 0.700 | 384 |
| Trafilatura | 0.619 | 16 |
| Pulpie Orange Small | 0.862 | 45 |
| Dripper | 0.864 | 135 |
| Pulpie Orange Base | 0.863 | 36 |
| Pulpie Orange Large | 0.873 | 21 |
Pulpie Orange Large is the strongest single model at 0.873, ahead of Dripper by 0.9 F1 points. The 210M model ties Dripper at a third the size. Frontier LLMs score higher on this benchmark, near 0.90, which is the quality Pulpie approaches.
Dripper returns nothing on 135 pages. 130 are due to the page overflowing its 32k-token context window. Pulpie packs blocks into 8,192-token chunks, so page length never forces a failure.
Breaking results down by difficulty:
| Method | All | Simple | Mid | Hard |
|---|---|---|---|---|
| magic-html | 0.700 | 0.773 | 0.697 | 0.637 |
| Trafilatura | 0.619 | 0.721 | 0.619 | 0.526 |
| Pulpie Orange Small | 0.862 | 0.906 | 0.868 | 0.813 |
| Dripper | 0.864 | 0.913 | 0.865 | 0.817 |
| Pulpie Orange Base | 0.863 | 0.906 | 0.868 | 0.818 |
| Pulpie Orange Large | 0.873 | 0.914 | 0.879 | 0.827 |
Every method loses ground as pages get harder. The heuristics fall fastest, dropping 14 to 20 F1 points from simple to hard, while the encoders give up about 9 F1 points. Dripper’s performance range matches the encoders, with a gap of 10 F1 points between simple and hard pages.
Speed
Throughput by Model
20x faster than Dripper on L4, comparing Pulpie Small on the same pages.
L4 throughput, on 500 real Common Crawl pages:
| Method | Throughput (pages/sec) | Hardware |
|---|---|---|
| Pulpie Orange Small | 13.7 | L4 |
| Dripper | 0.68 | L4 |
| Pulpie Orange Base | 3.9 | L4 |
| Pulpie Orange Large | 1.3 | L4 |
Pulpie Orange Small runs 20x faster than Dripper on the same L4.
A100 throughput, same pages, GPU inference only, batched for every model:
| Method | Throughput (pages/sec) | Hardware |
|---|---|---|
| Pulpie Orange Small | 25.7 | A100 |
| Dripper | 3.6 | A100 |
| Pulpie Orange Base | 7.7 | A100 |
| Pulpie Orange Large | 3.5 | A100 |
On the A100, Pulpie Orange Small runs 7.1x faster than Dripper. The 2.1B teacher matches Dripper on speed while beating it on quality.
Cost
Cost per 1B Pages
20x cheaper than Dripper on L4, comparing Pulpie Small on the same pages.
L4 cost for 1 billion pages at $0.39/hr. Calculated using the throughputs measured above:
| Setup | Pages/sec | GPU-hours / 1B | Cost / 1B pages |
|---|---|---|---|
| Pulpie Small on L4 | 13.7 | 20,300 | ~$7,900 |
| Dripper on L4 | 0.68 | 408,000 | ~$159,000 |
| Pulpie Base on L4 | 3.9 | 71,200 | ~$28,000 |
| Pulpie Large on L4 | 1.3 | 214,000 | ~$83,000 |
A100 cost for 1 billion pages at $2.72/hr. Calculated using the throughputs measured above:
| Setup | Pages/sec | GPU-hours / 1B | Cost / 1B pages |
|---|---|---|---|
| Pulpie Small on A100 | 25.7 | 10,800 | ~$29,000 |
| Dripper on A100 | 3.6 | 77,200 | ~$210,000 |
| Pulpie Base on A100 | 7.7 | 36,100 | ~$98,000 |
| Pulpie Large on A100 | 3.5 | 79,400 | ~$216,000 |
Cheap GPUs like Encoders
The throughput gap between Pulpie and Dripper is much larger than a 3x difference in size would imply. On the A100, we measure this gap as 7.1x, and on the L4 it widens to 20x. The reason for this is architectural.
A decoder generates labels one token at a time. Each step reads the full model from GPU memory to produce a single token. Consequently, a decoder’s speed is bound by memory bandwidth. Conversely, an encoder runs one forward pass over the whole input. This dense matrix multiply is limited only by compute.
Add to the above that A100 and L4 differ more in bandwidth than in compute:
| Dimension | NVIDIA A100 | NVIDIA L4 | Ratio (A100/L4) |
|---|---|---|---|
| Memory Bandwidth | 2,039 GB/s | 300 GB/s | ~6.8x |
| Tensor Core TFLOPS | 312 | 120 | ~2.6x |
Dropping from A100 to L4 starves the bandwidth-bound decoder far more than the compute-bound encoder. This widens the throughput gap and lets Pulpie Orange Large pull ahead on L4 despite matching Dripper on A100.
Get started
The Pulpie models are on Hugging Face. Install the package:
pip install pulpie
Extract clean content from raw HTML:
from pulpie import Extractor
extractor = Extractor() # defaults to Pulpie Orange Small
result = extractor.extract(html)
print(result.markdown) # clean markdown
print(result.n_main, result.n_other) # blocks kept vs dropped
For maximum quality over speed, pick a larger model:
extractor = Extractor(model="large") # "small" (default), "base", or "large"
For bulk processing, the pipeline overlaps CPU preprocessing with GPU inference across one or more GPUs:
from pulpie import Pipeline, PageInput
pipeline = Pipeline(model="small")
results = pipeline.extract_batch(
[PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)
All three models are built on EuroBERT (Boizard et al., 2025), use the same <|sep|> block-marker architecture, and share a tokenizer:
| Name | Hugging Face | Parameters | ROUGE-5 F1 | Notes |
|---|---|---|---|---|
| Orange Small | feyninc/pulpie-orange-small-v1 | 210M | 0.862 | Recommended |
| Orange Base | feyninc/pulpie-orange-base-v1 | 610M | 0.863 | Distilled from Large |
| Orange Large | feyninc/pulpie-orange-large-v1 | 2.1B | 0.873 | Teacher |
Pulpie Orange Small is the recommended and default model. It approaches SOTA extraction quality at one twentieth the cost and runs the fastest.
Pulpie is built by Feyn. Find us on GitHub, Hugging Face, or X.
Acknowledgements
Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We’re grateful they released their tools and data.
@note{pulpie2026,
title = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
year = {2026},
venue = {Feyn Field Notes}
}