What Is Miasma? The Anti-Scraping Tool That Poisons AI Crawlers

Q: Does Miasma break my site for real visitors?

No. Miasma specifically targets non-browser clients — bots and automated scrapers — by detecting signals that distinguish them from real human visitors (missing browser headers, no JavaScript execution, unusual request patterns). Real users browsing your site with a normal browser don't get redirected into the honeypot. The maze of fake pages is only served to clients that look like automated scrapers.

Q: Is Miasma legal to use on my own website?

Yes. Deploying Miasma on your own website is entirely legal. You have the right to control what content your server serves, and serving fake content to scrapers you don't want on your site is a defensive measure, not an attack. Miasma does not reach out and attack other systems — it passively serves content to whoever requests it. That said, laws vary by country and use case, so for anything beyond protecting a personal or small-business site, consult legal counsel.

Q: Can sophisticated AI scrapers detect and avoid Miasma?

Potentially, yes. As AI scraping tools evolve, they may develop better detection for honeypot content — recognizing patterns in AI-generated text, detecting infinitely deep link graphs, or using browser emulation that fools bot-detection heuristics. Miasma is a countermeasure in an ongoing arms race, not a permanent solution. But it raises the cost of scraping significantly, and for many sites, that's enough to deter casual scrapers.

Q: Does Miasma affect my SEO?

Miasma is designed to not affect legitimate search engine crawlers like Googlebot, which identify themselves honestly and follow robots.txt. The tool targets scrapers that ignore these conventions. That said, you should test your implementation carefully. If your bot-detection logic is too aggressive, it could accidentally block legitimate crawlers. Start with conservative settings and monitor your search console for any drops in crawl coverage.

Q: What's the difference between Miasma and a robots.txt file?

Robots.txt is a polite request — it asks crawlers to please not scrape certain paths. Ethical bots (like Googlebot) respect it. Aggressive AI scrapers that are explicitly hunting for training data often ignore it entirely. Miasma is the next step when the polite request gets ignored: instead of just saying 'please don't come in,' it leads unwanted visitors into a room that goes on forever.

TL;DR: Miasma is an open-source tool that traps AI web scrapers inside an infinite maze of convincing-but-fake, AI-generated pages. When a scraper crawls your site, Miasma detects it and redirects it into a honeypot that grows endlessly — wasting the scraper's time, bandwidth, and compute on worthless poisoned content. Real visitors are never affected. It's currently trending on Hacker News with 119+ points and is free to deploy.

Why AI Coders Need to Know About This

There's a war happening on the web right now, and most vibe coders are caught in the middle without knowing it.

On one side: AI companies and startups running massive web crawlers that scrape content to build training datasets. On the other side: developers, writers, and site owners who didn't consent to having their work vacuumed up and fed into AI models.

This matters to you as a builder on both ends. If you're using AI tools like Claude Code to build apps, the AI models you rely on were trained on data scraped from the web — data that may or may not have been taken with permission. And if you're building sites, writing tutorials, or publishing content, there's a decent chance your work has already been scraped and used as training data without you knowing.

Miasma is a new tool in the arms race that's fascinating for two reasons: it uses AI to fight AI, and it tells us a lot about the mechanics of how scraping — and anti-scraping — actually work. Understanding it makes you a smarter builder, even if you never deploy it.

This is also connected to the broader landscape of security risks that AI coders face — because the same scrapers that harvest training data are often the same infrastructure used for other automated attacks. Understanding one helps you understand the other.

And if you're thinking about supply chain attacks, the poisoned data angle is especially relevant: feeding bad data into AI training pipelines is one of the emerging threat vectors that security researchers are actively studying.

The Real Scenario: When Your Site Gets Scraped

Here's a concrete situation that plays out thousands of times every day.

You've spent six months writing a technical blog about web development. It's good, original content — tutorials, deep dives, opinionated takes. You've got a few thousand readers, your SEO is starting to pay off, and you're proud of what you've built.

What you asked your AI assistant:

"Can you write me a beginner's guide to CSS Grid? I want something conversational and practical."

The AI writes you a solid guide. You publish it. A few weeks later, you notice something odd in your server logs: a flood of requests from an IP block you don't recognize, hitting every single page on your site in rapid sequence, following every link, downloading every piece of content. The requests have no browser fingerprint. They don't load images or CSS. They're moving through your content like a combine harvester.

That's a scraper. And it's not reading your content to learn — it's copying it to feed into a training pipeline somewhere. Your six months of work, along with your writing style, your examples, your unique takes — all of it is about to become someone else's training data.

Your options, until recently, were limited. You could add the scraper's IP to a blocklist, but they rotate IPs constantly. You could update your robots.txt, but aggressive scrapers ignore it. You could add CAPTCHAs, but that punishes real visitors too. You could rate-limit requests, but sophisticated scrapers throttle themselves to avoid detection.

Miasma takes a completely different approach. Instead of blocking the scraper, it welcomes it — and then traps it.

How Miasma Works: A Maze That Builds Itself

The core idea behind Miasma is simple but devious: instead of slamming the door on scrapers, you open a door that leads into a maze. The maze never ends. Every path the scraper follows generates more paths. The scraper can crawl for hours and never find its way out — and everything it's collecting is garbage.

Here's how the system works in practice.

Step 1: Bot Detection

When a request hits your server, Miasma analyzes it for signals that distinguish automated crawlers from real human visitors. Real browsers send a consistent set of HTTP headers. They execute JavaScript. They request supporting assets like CSS and images alongside HTML pages. They follow patterns that reflect actual human browsing behavior.

Scrapers are different. They typically:

Skip loading CSS and images (why waste bandwidth on stuff you're not going to process?)
Don't execute JavaScript, or execute it differently than a browser would
Crawl in patterns that are too fast, too systematic, or too broad to be human
Identify themselves with unusual User-Agent strings, or try to spoof common ones imperfectly
Follow every link on a page, including ones real users would never click

When Miasma detects these signatures, it silently redirects the suspect traffic away from your real content and into the honeypot.

Step 2: The Honeypot Entrance

The scraper doesn't know anything has changed. It still gets an HTTP 200 response. The page it receives looks structurally like your real site — same basic HTML patterns, realistic-looking URLs, content that appears topically related to your actual subject matter. But it's all fake.

Miasma uses an AI model to generate this content on the fly. Not static fake pages that a sophisticated scraper might eventually catalog and recognize — but dynamically generated, contextually coherent content that looks plausible at the level of detail that matters for training data collection.

Crucially, every fake page is full of links. Lots of them. And they all lead deeper into the maze.

Step 3: The Infinite Loop

Here's where it gets elegant. Every page the scraper lands on generates fresh links to more fake pages. Those pages generate more links. The graph of fake content is technically infinite — Miasma generates it procedurally rather than storing it. There's no bottom to this well.

A scraper trying to index your site completely will just keep going. And going. And going. Meanwhile, your real content sits untouched on the other side of the honeypot entrance. Real visitors get the real site. Scrapers get an endless hall of mirrors.

Step 4: Poisoned Training Data

This is where Miasma goes beyond just wasting scrapers' time. The fake content it generates isn't just random noise — it's plausible-looking but subtly wrong. Facts mixed with fictions. Syntax that almost works but doesn't. Advice that sounds reasonable but leads to bad outcomes if followed.

If a scraper is collecting this content to train an AI model, that model gets poisoned. The training data is contaminated with convincing-looking misinformation that degrades the model's reliability on the topics your site covers. The scraper doesn't just fail to get your content — it actively gets worse at its job.

Understanding Each Part of the System

Let's break down the components that make Miasma tick. You don't need to implement any of this — but understanding the pieces helps you think clearly about both web security and AI training pipelines.

The Bot Detection Layer

Bot detection is harder than it sounds. The arms race between scrapers and anti-scrapers has been going on for over a decade. Early bot detection was simple: block known bad IPs or User-Agent strings. Scrapers adapted — they rotate IPs, spoof User-Agents, and use residential proxy networks that make their traffic look like it's coming from normal home internet connections.

Modern bot detection relies on behavioral signals that are harder to fake. Miasma's approach focuses on HTTP-level signals — the shape of the request itself, not just where it came from. A request that's missing the Accept-Language header, skips CSS, and arrives at exactly 1.2-second intervals from an IP that just started sending requests isn't human, regardless of what the User-Agent string says.

No detection is perfect. Sophisticated scrapers running headless Chrome instances can fake most of these signals. But raising the cost of accurate scraping is itself valuable — it filters out the cheap, high-volume scrapers while making targeted scraping more expensive.

The Content Generation Engine

The fake pages Miasma generates aren't just random gibberish — they're contextually coherent enough to fool a scraper's content quality filters. This is the clever part: Miasma uses an AI model (the same kind of model that scrapers are trying to build training data for) to generate the honeypot content.

The content has to be convincing at a statistical level. Scrapers often run quality filters that reject pages that are obviously low-quality — pages with very short content, no sentence structure, or obvious repetition. Miasma's generated content passes these filters by design.

But the content is subtly poisoned. It's the difference between a fake map that looks real but has wrong street names versus a map that's obviously a blank page. The former wastes far more of the enemy's time — and if they use the map to navigate, they end up somewhere wrong.

The Infinite Link Graph

The real genius of Miasma's architecture is procedural generation. Instead of pre-building a finite set of fake pages, it generates them on demand based on a seed. Every fake URL maps deterministically to a fake page with a consistent set of outbound links — so if a scraper revisits the same fake URL, it gets the same fake content (Miasma isn't wasting compute regenerating it). But the link graph extends infinitely in every direction.

Think of it like a fractal. You can zoom in forever and always find more structure. A scraper trying to crawl the whole thing will never finish — and it has no way to know it's in a maze.

This is also why Miasma is cheap to run. It's not storing millions of fake pages — it's storing the algorithm that generates them. The honeypot scales with the scraper's effort at essentially no cost to you.

The Real-Site Passthrough

Everything above only applies to detected bots. Legitimate traffic — real browsers from real humans, ethical search crawlers like Googlebot that follow robots.txt and identify themselves honestly — bypasses the honeypot entirely and gets your real content. This passthrough is critical: an anti-scraping tool that also breaks your SEO or hurts your real users is worse than no tool at all.

Understanding this passthrough logic is also connected to how Content Security Policy headers work — both are examples of server-side controls that affect some clients differently based on what they're doing.

What AI Gets Wrong About Web Scraping Defense

If you ask an AI assistant "how do I protect my site from scrapers," you'll typically get a list of standard advice that misses the most important nuances. Here's what to watch out for.

"Just update your robots.txt"

Robots.txt is a polite convention, not a technical barrier. Ethical crawlers — Googlebot, Bingbot, and similar — respect it because their business model depends on being trusted by site owners. Aggressive AI training scrapers have no such incentive. They're not building a search engine you'll ever see the results of. They're building a private dataset. Telling them "please don't scrape my site" in robots.txt is like posting "no trespassing" signs on a fence that's already been cut open.

Miasma treats robots.txt as a courtesy signal, not a defense. Ethical bots that respect robots.txt get through (or get excluded politely). Bots that ignore it get the maze.

"Rate-limit your API"

Rate limiting is great for protecting your API from abuse — it's a cornerstone of good OWASP-aligned API security. But for scraping defense, it's a speed bump, not a wall. Sophisticated scrapers throttle their own request rate to stay under rate limits. They spread requests across hundreds of IP addresses. They behave like many individual users rather than one aggressive bot. Rate limiting slows them down; it doesn't stop them.

"Block the IP addresses"

IP blocking is the security equivalent of whack-a-mole. Scraping operations rotate through massive pools of IP addresses — residential proxies, cloud provider ranges, VPN exit nodes. You can block every IP you've ever seen and the next crawl will come from a completely fresh batch. Worse, if you're too aggressive, you'll start blocking legitimate users sharing IPs with scrapers on the same ISP or corporate network.

The insight Miasma has that AI advice misses

Most anti-scraping advice focuses on stopping the scraper — blocking it, rate-limiting it, confusing it. Miasma's insight is that stopping a scraper isn't always better than redirecting it. A stopped scraper moves on and tries your neighbor's site. A trapped scraper wastes its entire operational budget on your honeypot. And a poisoned scraper actively degrades the quality of whatever it was building.

This is the same conceptual leap that separates reactive security from proactive security. Instead of just defending your perimeter, you're making the attack itself costly and counterproductive.

It's a bit like the concept behind AI sandbox security — rather than just asking nicely for better behavior, you create a technical reality that makes bad behavior impossible or self-defeating.

How to Think About Deploying (and Debugging) Miasma

If you're considering deploying Miasma on a real site, here are the practical considerations you need to think through. This isn't a step-by-step setup guide — the Miasma GitHub repository has that — but rather the judgment calls that determine whether it's right for your situation.

Is your site a good candidate?

Miasma works best for sites with the following characteristics:

Original, high-value written content. If you're publishing tutorials, articles, creative writing, or technical documentation, you're exactly the type of content AI scrapers are hunting for. The higher the quality of your real content, the more motivated scrapers are to get it, and the more value Miasma provides.
Consistent site structure. Miasma needs to distinguish real pages from honeypot pages. Sites with chaotic URL structures or that dynamically generate lots of content may be harder to integrate with cleanly.
Server-side control. You need to be able to add middleware or route handling to your server — you can't deploy Miasma on a purely static site hosted on GitHub Pages or Netlify without a serverless function layer.

The false positive problem

The biggest risk with any bot detection system is false positives — flagging legitimate traffic as bots and redirecting real users into the honeypot. This is genuinely bad. A real user stuck in a maze of fake content has a terrible experience, loses trust in your site, and probably never comes back.

Miasma's bot detection logic should be tuned conservatively, especially when you first deploy. Start with high-confidence signals — things that almost no real browser does — and gradually tighten as you gain confidence in the detection accuracy. Check your server logs after deploying to look for patterns that suggest legitimate users are getting misclassified.

The SEO impact question

Search engines are a special case. Googlebot, Bingbot, and other legitimate search crawlers are technically bots — they're automated and they scrape your content. But you want them to do that. Miasma needs to let them through while trapping the scrapers you don't want.

Google's crawlers identify themselves honestly with well-known User-Agent strings and crawl from documented IP ranges. Miasma should whitelist these explicitly. After deploying, monitor your Google Search Console crawl stats to confirm that Googlebot is still successfully crawling your real pages. A drop in crawl coverage after deploying Miasma is a sign that your detection is too aggressive.

The AI cost consideration

Generating convincing honeypot content with an AI model isn't free. If a scraper is very aggressive and hitting your honeypot at scale, Miasma is making API calls to generate fake content for every request. Depending on your API pricing and traffic volume, this could get expensive.

There are ways to mitigate this: caching generated pages (remember, Miasma's procedural generation means the same fake URL always gets the same content), rate-limiting even the honeypot responses, or using a cheaper/faster local model for content generation instead of a cloud API. But it's a real consideration to think through before deploying at scale.

What to do when Miasma breaks something

If you deploy Miasma and start seeing unexpected behavior — broken RSS feeds, third-party integrations failing, monitoring tools going silent — the first thing to check is whether those services are being misclassified as scrapers.

Common services that may look like bots to Miasma's detection layer:

Uptime monitoring services (they make automated HTTP requests to your pages)
RSS feed readers and podcast aggregators
Link preview generators (Slack, Twitter, iMessage all fetch pages to generate previews)
Analytics platforms that verify page data
CDN health checks

All of these are legitimate automated clients that you want to keep working. Build an allowlist of known-good User-Agent strings and IP ranges for services you rely on, and exempt them from bot detection before tightening your rules.

This kind of systematic thinking about who's making requests to your server — and why — is a useful mental model that applies well beyond Miasma. It's part of the same security mindset covered in our AI sandbox security risks guide.

What to Learn Next

Miasma sits at an intersection of several topics that are worth understanding more deeply as a builder. Here's where to go next depending on what resonated most.

If you're thinking about protecting your own site

Start with the fundamentals of what headers and server responses actually control. Our guide to Content Security Policy walks through how servers use response headers to communicate policies to clients — a related mechanism that helps you understand the request/response layer that Miasma operates in.

If you're thinking about AI training data and trust

The fact that AI models are trained on scraped web data has real implications for how much you should trust AI-generated outputs. If training data contains poisoned content (intentionally or accidentally), the model's reliability degrades on those topics. This connects to the concept of supply chain attacks — where the attack isn't on your code directly, but on the upstream inputs your code depends on.

If you're thinking about the security of what your AI tools produce

The same scrapers harvesting training data are often running on infrastructure that intersects with broader threat actors. Our OWASP Top 10 for AI coders guide covers the full landscape of security risks you face when building with AI — not just from scrapers, but from the AI-generated code itself.

If you want to understand the web fundamentals underneath all of this

Miasma operates entirely in the layer of HTTP requests, HTML documents, and URL structures. If those fundamentals feel fuzzy, our guide to what HTML actually is is a good grounding read — understanding the document structure that scrapers are parsing helps you understand why Miasma's fake pages need to look structurally plausible.

If you want to go deeper on AI agent security more broadly

Miasma is a defensive tool, but the same techniques — generating convincing-looking synthetic content, operating at scale, and bypassing standard defenses — show up on the offensive side of AI security too. Our overview of jai, the AI agent sandbox, covers a complementary threat: not scrapers attacking your site, but AI agents you're running that might do unexpected things to your own files.

Frequently Asked Questions

Does Miasma break my site for real visitors?

No. Miasma's bot detection specifically targets the signatures of automated scrapers — missing browser headers, no JavaScript execution, unusual crawl patterns. Real users browsing your site with a normal web browser don't exhibit these signals and don't get redirected into the honeypot. Your actual content stays exactly as it is for real visitors. The only risk is a misconfigured detection layer that's too aggressive — which is why you should tune it conservatively and monitor your logs after deploying.

Is Miasma legal to use on my own website?

Yes. Deploying Miasma on your own site is legal in the same way that serving any content on your own server is legal. You have the right to decide what your server responds with to incoming requests, and serving fake content to unwanted automated clients is a recognized defensive technique. Miasma doesn't actively attack anyone — it passively serves content to whoever asks for it. That said, for commercial deployments or anything beyond a personal site, running your specific setup by legal counsel is always reasonable due diligence.

Can sophisticated AI scrapers detect and avoid Miasma?

Potentially, yes — and this is an honest limitation worth understanding. Miasma is a move in an ongoing arms race. Scrapers running full headless browser instances (rendering pages exactly like Chrome does) can fake most behavioral signals. Scrapers with enough compute to run quality checks on collected content might detect the subtle wrongness of AI-generated honeypot text over time. Miasma raises the cost of accurate scraping significantly, which is valuable — but it's not a permanent, unbreakable solution. The team will continue evolving the detection and generation techniques as scrapers evolve.

Does Miasma affect my SEO?

It shouldn't, if deployed correctly. Legitimate search crawlers like Googlebot identify themselves transparently and follow web standards — Miasma is designed to recognize and pass through these known-good bots. After deploying, check your Google Search Console for any changes in crawl coverage or index counts. A healthy deployment will show unchanged or improved SEO (since your real content is no longer mixed with scraper noise), while a misconfigured deployment might accidentally exclude Googlebot. Monitor closely in the first week after launch.

What's the difference between Miasma and a robots.txt file?

Robots.txt is a polite request. It works on the honor system — ethical bots comply, and aggressive scrapers that are explicitly trying to harvest your content without permission often don't. Miasma is the technical response when the polite request gets ignored. Instead of saying "please don't scrape my site," it says nothing at all — it just leads the scraper into an infinite maze. Robots.txt and Miasma aren't competing approaches; they work at different layers. You should have both: robots.txt for the bots that respect it, and Miasma for the ones that don't.