Why Not Just Use fetch()?
The Honest Answer
You've probably looked at StripFeed and thought: "Why would I pay for this? I can just fetch() the URL myself."
That's a fair question. And honestly, for simple cases, you're right. fetch() works. You hit a URL, you get HTML back, you feed it to your LLM. Job done.
Until it isn't.
The gap between "fetch a URL" and "give my AI agent clean, token-efficient content" is wider than it looks. It's not one problem. It's nine problems stacked on top of each other. And each one is just annoying enough that you'd rather not solve it yourself.
What fetch() Actually Returns
Let's say your AI agent needs to read a blog post. Here's what you get:
const response = await fetch("https://example.com/blog/great-article");
const html = await response.text();
// Now what?
That html variable contains the article, sure. But it also contains:
<nav>with 40+ navigation links<script>tags for analytics, ads, and tracking<style>blocks with hundreds of lines of CSS- Cookie consent banners
- Social sharing widgets
- Related article sidebars
- Footer with sitemap links
<meta>,<link>, and<head>noise- Inline SVGs for icons
Here's what that looks like in tokens:
| What you get | Tokens | Useful for your LLM? |
|---|---|---|
| Navigation + header | ~2,400 | No |
| Scripts + tracking | ~3,800 | No |
| CSS + styles | ~2,200 | No |
| Sidebar + related posts | ~1,600 | No |
| Footer + cookie banner | ~1,200 | No |
| The actual article | ~3,100 | Yes |
| Total raw HTML | ~14,300 |
That's 78% noise. Your agent is spending tokens reading <div class="ad-wrapper"> and onclick="gtag('event', 'click')" instead of the content it actually needs.
Why This Matters (Even on a Subscription)
"But I'm on Claude Pro / ChatGPT Plus. I don't pay per token." Fair. But wasted tokens still hurt you in three ways:
- Pay-per-token (API): You're paying for every token. 80% noise means your bill is 3-5x higher than it needs to be. This is the most obvious case.
- Subscription plans: You don't pay per token, but you have usage limits. Fewer wasted tokens per request means your agent runs longer before hitting that limit. Instead of pausing every few hours, it keeps going.
- Context window quality: This one matters regardless of how you pay. When your agent's context is full of navigation HTML, cookie banners, and tracking scripts, it has less room for actual content. The model's responses get worse because signal-to-noise ratio drops. Clean Markdown means your agent understands the page better and produces better results.
The 9-Step DIY Pipeline
To go from fetch() to clean, LLM-ready content, here's what you'd need to build:
1. HTML Parsing. You need a DOM parser. In Node.js that means jsdom (heavy, 2MB+) or linkedom (lighter). In Python, BeautifulSoup or lxml. Each has quirks with malformed HTML.
2. Content Extraction. The hard part. You need to identify which part of the page is "the content" and which is chrome. Mozilla's Readability algorithm does this, but it's a non-trivial dependency to integrate and configure. It doesn't work on every site.
3. Noise Removal. Even after extraction, you'll find leftover junk: empty links, tracking pixels disguised as images, inline scripts, hidden elements. You need custom rules for these.
4. Markdown Conversion. HTML to Markdown sounds simple until you handle nested lists, code blocks with language hints, tables, and edge cases like <pre> inside <blockquote>. Libraries like Turndown help, but need custom rules to produce clean output.
5. Token Counting. Different models use different tokenizers. If you care about costs (and you should), you need tiktoken or js-tiktoken with the right encoder for your model.
6. Smart Truncation. If the content exceeds your token budget, you can't just slice the string. You'll cut words, sentences, or even Markdown syntax in half. You need paragraph-boundary or sentence-boundary truncation.
7. Caching. You don't want to re-fetch and re-process the same URL every time your agent encounters it. That means Redis, Memcached, or at minimum a filesystem cache with TTL expiry.
8. Rate Limiting. Hammering a site with rapid requests gets you blocked. You need backoff logic, request queuing, or rate limiting on your own code.
9. Edge Cases. Timeouts (some sites take 10+ seconds). Redirects (HTTP to HTTPS, www to non-www, paywall redirects). Sites that already serve Markdown (like some Cloudflare-enabled sites responding to Accept: text/markdown). Sites with JavaScript-rendered content. 403 responses. Gzipped responses. Character encoding issues.
Each of these is solvable. None of them is hard in isolation. But together, they're a week of work that has nothing to do with your actual product.
The Code Comparison
Here's the DIY approach:
TypeScript (DIY)
import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";
import TurndownService from "turndown";
import { encodingForModel } from "js-tiktoken";
const encoder = encodingForModel("gpt-4o");
const turndown = new TurndownService({ headingStyle: "atx" });
async function fetchAndStrip(url: string, maxTokens?: number) {
const response = await fetch(url, {
signal: AbortSignal.timeout(9000),
headers: { "User-Agent": "MyAgent/1.0" },
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const html = await response.text();
const dom = new JSDOM(html, { url });
const article = new Readability(dom.window.document).parse();
if (!article?.content) throw new Error("Extraction failed");
let markdown = turndown.turndown(article.content);
// Basic noise cleanup
markdown = markdown
.replace(/\[]\(.*?\)/g, "") // empty links
.replace(/!\[.*?\]\(data:.*?\)/g, "") // data-uri images
.replace(/\n{3,}/g, "\n\n"); // excess newlines
const tokens = encoder.encode(markdown);
if (maxTokens && tokens.length > maxTokens) {
// Naive truncation (won't respect paragraph boundaries)
const truncated = encoder.decode(tokens.slice(0, maxTokens));
markdown = truncated;
}
return {
markdown,
tokens: tokens.length,
title: article.title,
};
}
That's ~30 lines, three dependencies, no caching, no rate limiting, and the truncation cuts mid-sentence.
TypeScript (StripFeed)
import StripFeed from "stripfeed";
const sf = new StripFeed("sf_live_your_key");
const result = await sf.fetch("https://example.com/article", {
maxTokens: 3000,
});
console.log(result.markdown);
Three lines. Caching, rate limiting, smart truncation, and token counting included.
Python (DIY)
import requests
from readability import Document
from markdownify import markdownify
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
def fetch_and_strip(url: str, max_tokens: int = None) -> dict:
resp = requests.get(url, timeout=9, headers={"User-Agent": "MyAgent/1.0"})
resp.raise_for_status()
doc = Document(resp.text)
markdown = markdownify(doc.summary(), heading_style="ATX")
tokens = encoder.encode(markdown)
if max_tokens and len(tokens) > max_tokens:
markdown = encoder.decode(tokens[:max_tokens])
return {
"markdown": markdown,
"tokens": len(tokens),
"title": doc.title(),
}
Python (StripFeed)
from stripfeed import StripFeed
sf = StripFeed("sf_live_your_key")
result = sf.fetch("https://example.com/article", max_tokens=3000)
print(result.markdown)
Same story. Three lines versus a custom pipeline.
What You Don't Get with fetch()
Beyond the basic extraction pipeline, StripFeed handles things that are genuinely hard to build yourself:
| Feature | DIY with fetch() | StripFeed |
|---|---|---|
| Content extraction | Build it yourself (Readability + custom rules) | Built-in |
| Markdown conversion | Configure Turndown/markdownify + edge cases | Built-in |
| Token counting | Add tiktoken, manage encoders | Every response includes token count |
| Smart truncation | Build paragraph/sentence boundary logic | max_tokens parameter, truncates cleanly |
| Caching | Set up Redis or similar | Built-in, configurable TTL (up to 24hr) |
| Batch processing | Write parallel fetch + error handling | POST /api/v1/batch, up to 10 URLs |
| CSS selector extraction | Integrate a DOM query layer | selector parameter |
| Multiple output formats | Build format converters | format=markdown|json|text|html |
| Usage analytics | Build a logging pipeline + dashboard | Dashboard with per-key, per-model stats |
| Cost tracking per model | Manual calculation | Pass model parameter, see costs in dashboard |
When fetch() Is Actually Fine
Let's be real. You don't always need StripFeed.
fetch() is enough when:
- You're scraping specific sites you know well. A custom pipeline tuned to their HTML structure will always beat a generic solution. You know exactly where the content lives, what to strip, what to keep.
- You're fetching a handful of pages per day from sites you control
- The content is simple and well-structured (like an API docs page)
- You don't care about token optimization
- You're in a prototype phase and just need something working
StripFeed earns its keep when:
- Your agent receives arbitrary URLs from users or other systems and needs to handle whatever it finds. You can't write a custom scraper for every site on the internet.
- Your agent reads diverse sources (news sites, blogs, documentation, forums). Each has different HTML structure.
- You're processing hundreds or thousands of pages. Token savings compound fast. At 1,000 pages/day with 78% noise reduction, you're saving ~8.7M tokens daily.
- You care about cost tracking and want to know exactly what each URL costs per model.
- You need reliable extraction across the messy web, not just clean sites.
- You'd rather spend your engineering time on your actual product instead of maintaining a content extraction pipeline.
Try It Yourself
The fastest way to see the difference is the live demo. Paste any URL and compare the raw HTML tokens against the clean Markdown output.
Or grab a free API key and try it in your pipeline. 200 requests/month, no credit card:
curl "https://www.stripfeed.dev/api/v1/fetch?url=https://example.com" \
-H "Authorization: Bearer sf_live_your_key"
Sign up free and see what your agent has been missing.