TL;DR
Web scraping is automated data collection from websites. When you ask AI to grab prices, listings, or content from a site, it generates a scraper — a script that visits pages, reads the HTML, and extracts the data you want. It is powerful, common, and sometimes legally risky.
Why AI Coders Need to Know This
Here is a pattern that plays out constantly in vibe coding: you are building a side project — maybe a price tracker, a job board aggregator, or a tool that monitors competitor inventory. You tell Claude or Cursor something like "pull all the product names and prices from this URL and save them to a CSV." Within seconds, the AI hands you a fully working script.
That script is a web scraper. And unless you understand what it does, you are running code that visits other people's websites on autopilot, potentially hundreds or thousands of times, without knowing whether that is fine or whether it is going to get your IP address blocked, your hosting account suspended, or worse.
Web scraping sits at the intersection of several things vibe coders already work with: JavaScript, Node.js, HTML structure, and async/await patterns. AI generates scrapers so fluently that it is easy to forget you are deploying an automated bot against someone else's infrastructure. This article gives you the knowledge to use scraping responsibly and fix it when it breaks.
What Web Scraping Actually Does
At its core, web scraping is straightforward. A scraper does three things:
- Fetches a web page — just like your browser does when you type a URL, but in code.
- Reads the HTML — the raw structure of the page, including all the tags, classes, and text content.
- Extracts specific data — pulls out the pieces you care about (prices, titles, dates, links) and ignores the rest.
Think of it like this: when you visit a store's website and manually copy prices into a spreadsheet, that is scraping — done by hand. A web scraper automates that process so you can collect data from hundreds or thousands of pages without clicking through each one yourself.
The difference between scraping and using an API is important. An API is a front door — the website deliberately built it so programs can request data in a clean, structured format. Scraping is more like reading the menu posted in the window. You are getting the information from the public-facing page, but you are doing it in a way the site may not have intended for automated tools.
The three tools AI reaches for
When you ask AI to scrape something, it almost always picks one of three Node.js libraries. Each one works differently, and knowing which is which saves you hours of confusion.
Cheerio
What it is: A fast, lightweight HTML parser. It downloads the raw HTML of a page and lets you search through it using CSS selectors — the same selectors you use in stylesheets.
When AI uses it: Simple, static pages where the content is already in the HTML when it loads. Blog posts, news articles, product pages on older sites.
Limitation: If the page loads content with JavaScript after the initial load (like most modern React or Vue apps), Cheerio sees nothing — just an empty shell.
Puppeteer
What it is: A Node.js library that launches a real Chrome browser (headless — no visible window) and controls it programmatically. It can click buttons, fill forms, scroll pages, and wait for content to load.
When AI uses it: Any page that loads content dynamically, requires interaction (like clicking "Load More"), or needs JavaScript to render. E-commerce sites, dashboards, SPAs.
Limitation: Slower and heavier than Cheerio because it runs an entire browser. Uses more memory and CPU.
Playwright
What it is: Similar to Puppeteer but supports Chrome, Firefox, and Safari. Built by Microsoft. Often considered the more modern, reliable choice for browser automation.
When AI uses it: Same use cases as Puppeteer, but with better cross-browser support and more robust waiting mechanisms. Increasingly the default choice in newer AI-generated code.
Limitation: Same overhead as Puppeteer — it launches a real browser. Slightly larger install size since it bundles multiple browser engines.
If AI imports cheerio, it is parsing static HTML. If it imports puppeteer or playwright, it is launching a real browser. The second approach is more powerful but heavier, slower, and more likely to trigger anti-bot detection.
The Real Scenario: "Scrape Prices From This Site"
Let's walk through what actually happens when you give an AI this kind of prompt:
Your prompt
"Write a Node.js script that scrapes all product names and prices from https://example-store.com/products and saves them to a CSV file."
Here is a realistic version of what AI generates. This example uses Puppeteer because most product pages load content dynamically:
import puppeteer from 'puppeteer';
import { writeFileSync } from 'fs';
async function scrapeProducts() {
// Launch a headless Chrome browser
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set a realistic user-agent so the site doesn't block us
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
// Navigate to the products page and wait for content to load
await page.goto('https://example-store.com/products', {
waitUntil: 'networkidle2',
timeout: 30000
});
// Extract product data from the page
const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-card');
return Array.from(items).map(item => ({
name: item.querySelector('.product-title')?.textContent?.trim() || 'Unknown',
price: item.querySelector('.product-price')?.textContent?.trim() || 'N/A'
}));
});
await browser.close();
// Convert to CSV and save
const csv = 'Name,Price\n' +
products.map(p => `"${p.name}","${p.price}"`).join('\n');
writeFileSync('products.csv', csv);
console.log(`Scraped ${products.length} products.`);
}
scrapeProducts().catch(console.error);
Let's break down what this code is doing, piece by piece:
puppeteer.launch()— Starts an invisible Chrome browser on your machine. This is a real browser, just without a window.page.setUserAgent()— Pretends to be a normal browser visitor. Without this, many sites detect automated traffic and block it.page.goto()— Navigates to the URL, like typing it into your address bar.waitUntil: 'networkidle2'means "wait until the page mostly stops loading."page.evaluate()— Runs JavaScript inside the browser page. This is where the actual data extraction happens. It finds all elements matching.product-cardand pulls out the name and price from each one.writeFileSync()— Saves the data to a CSV file on your computer.
Notice the async/await pattern throughout. Every browser action takes time — navigating, waiting, extracting — so the code awaits each step before moving on. If you have read our guide on async/await, this pattern should look familiar.
What the Cheerio version looks like
For comparison, here is how AI might write the same scraper using Cheerio for a simpler, static site:
import * as cheerio from 'cheerio';
async function scrapeProducts() {
const response = await fetch('https://example-store.com/products');
const html = await response.text();
// Load the HTML into Cheerio for parsing
const $ = cheerio.load(html);
const products = [];
$('.product-card').each((i, element) => {
products.push({
name: $(element).find('.product-title').text().trim(),
price: $(element).find('.product-price').text().trim()
});
});
console.log(`Found ${products.length} products.`);
return products;
}
scrapeProducts().catch(console.error);
See the difference? No browser launch, no user-agent trick, no waiting for dynamic content. Cheerio just grabs the raw HTML and parses it. Faster, simpler, but only works when the data is already in the HTML source.
The Legal and Ethical Reality
This is the section most tutorials skip, and it is the one that matters most. AI will happily generate a scraper for any website you point it at. It will not warn you when scraping that site could create legal problems. That is your job.
What is robots.txt?
Almost every website has a file at /robots.txt that tells automated tools what they are and are not allowed to access. For example, https://example.com/robots.txt might say:
User-agent: *
Disallow: /account/
Disallow: /checkout/
Allow: /products/
Crawl-delay: 10
This says: "Bots can access /products/ but should stay away from /account/ and /checkout/, and please wait 10 seconds between requests." Robots.txt is not legally binding in itself, but ignoring it shows bad faith and can be used against you in a legal dispute.
Terms of Service
Most websites include a Terms of Service (ToS) that explicitly prohibits automated data collection. Violating ToS is a breach of contract, and companies like LinkedIn, Meta, and Amazon have sued scrapers over it. The fact that AI wrote the scraper does not change your liability — you deployed it.
When scraping is generally fine
- Public data, personal use — Scraping public web pages for personal research, learning, or one-time data gathering is low risk.
- Your own sites — Scraping your own website to audit content, check links, or extract data is always fine.
- Sites that explicitly allow it — Some sites provide open data or permissive robots.txt files.
- Government and public records — Much government data is explicitly public domain.
When scraping gets risky
- Behind a login wall — Scraping content that requires authentication is almost always a violation.
- At scale against commercial sites — Hammering Amazon with 10,000 requests per hour will get your IP blocked and could invite legal action.
- Reselling scraped data — Collecting data is one thing. Repackaging and selling it is a different legal category.
- Ignoring explicit prohibitions — If robots.txt says
Disallow: /and the ToS says "no automated access," you are operating at your own risk. - Copyrighted content — Scraping articles, images, or creative work to republish is a copyright issue regardless of the method.
Before running any scraper: check /robots.txt, skim the Terms of Service, and ask yourself — "Is there an API I could use instead?" If the answer is yes, use the API. APIs are more reliable, more legal, and less likely to break. Scraping is for when no better option exists.
The hiQ v. LinkedIn case
In 2022, the U.S. Ninth Circuit ruled that scraping publicly available LinkedIn data was not a violation of the Computer Fraud and Abuse Act. This is often cited as "scraping is legal," but the ruling was narrower than that. It said that accessing public data does not constitute unauthorized access under the CFAA. It did not say that scraping is always fine — ToS violations, copyright issues, and state laws still apply.
What Can Go Wrong
AI-generated scrapers break constantly. Not because the AI is bad at writing code, but because scraping is inherently fragile. You are writing code that depends on someone else's website staying exactly the same. Here are the failures you will hit most often:
1. The selectors are wrong
AI guesses the CSS selectors based on common patterns like .product-card or .price. But every site structures its HTML differently. If the real class name is .plp-item__price--current, the scraper returns nothing.
The fix: Open the actual page in Chrome, right-click the element you want, choose "Inspect," and look at the real class names. Give those to your AI.
2. The content loads dynamically
You used Cheerio (or plain fetch), but the page loads its data with JavaScript after the initial HTML arrives. The scraper sees an empty page.
The fix: Switch to Puppeteer or Playwright. Tell the AI: "The content loads dynamically with JavaScript. Use Puppeteer and wait for the product elements to appear before extracting data."
3. The site blocks you
Sites detect automated traffic through missing headers, unusual request patterns, no cookies, or known bot user-agent strings. When detected, they return a block page, a CAPTCHA, or a 403 error instead of the real content.
The fix: Set a realistic user-agent, add reasonable delays between requests (2-5 seconds), and do not hammer the site with hundreds of requests per minute. If Cloudflare blocks you, that is the site telling you to stop.
4. The site structure changes
Your scraper worked perfectly on Monday. On Thursday the site redesigned their product page, and every selector broke. This is the fundamental fragility of scraping — you depend on HTML that someone else controls.
The fix: Accept that scrapers require maintenance. Use broad selectors when possible, add error handling for missing elements, and log warnings when expected data is absent.
5. Rate limiting and IP bans
Making too many requests too fast is the quickest way to get blocked. Some sites return HTTP 429 (Too Many Requests). Others silently serve empty or misleading data. Some ban your IP entirely.
The fix: Always add delays between requests. A respectful scraper waits at least 1-2 seconds between pages. If the site has a Crawl-delay in robots.txt, respect it.
6. AI does not handle pagination
The product list has 50 pages, but AI only scraped the first one. This is extremely common — AI writes a scraper for a single URL and does not think about "Next" buttons or paginated results.
The fix: Tell the AI explicitly: "The results are paginated. Follow the pagination links and scrape all pages, not just the first one. Add a delay between page requests."
How to Debug a Broken Scraper
Debugging scrapers follows a specific pattern that is different from debugging most other code. The issue is almost never a syntax error — it is a mismatch between what AI assumed about the page and what the page actually looks like.
Step 1: See what the scraper actually receives
Before doing anything else, save the raw HTML your scraper downloads and open it in a text editor. Is the content there? If you see the product data in the HTML, the problem is your selectors. If you see an empty shell, a CAPTCHA page, or a "Please enable JavaScript" message, the problem is earlier in the pipeline.
// Add this to your scraper to save the raw HTML
import { writeFileSync } from 'fs';
const html = await page.content();
writeFileSync('debug-page.html', html);
console.log('Saved raw HTML to debug-page.html');
Step 2: Test your selectors in the browser
Open the real website in Chrome. Open DevTools (F12 or Cmd+Option+I). Go to the Console tab and type:
document.querySelectorAll('.product-card').length
If this returns 0, that selector does not match anything on the page. Use the Elements tab to find the correct selectors, then give them to your AI.
Step 3: Run headful (visible browser)
If you are using Puppeteer or Playwright, switch to headful mode so you can see what the browser actually does:
const browser = await puppeteer.launch({
headless: false, // Show the browser window
slowMo: 100 // Slow down actions so you can see them
});
This lets you watch the scraper navigate, click, and extract. You will immediately see if it is landing on a CAPTCHA page, a cookie consent dialog, or a different page than expected.
Step 4: Give AI the right debugging prompt
When your scraper fails, this prompt pattern works well:
Debugging prompt template
"My scraper returns 0 products. Here is the code: [paste code]. Here is the raw HTML the scraper receives: [paste first 200 lines of debug-page.html]. The actual product elements on the page use the class [paste real class from DevTools]. Update the selectors to match the real page structure. Keep the existing delay and error handling."
That prompt gives the AI everything it needs: the code, the actual HTML, the correct selectors, and constraints to prevent it from rewriting your entire script.
Step 5: Add logging everywhere
Scrapers fail silently far too often. Add console.log statements at every stage:
console.log('Navigating to page...');
await page.goto(url);
console.log('Page loaded. Extracting data...');
const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-card');
console.log(`Found ${items.length} items on page`); // This logs in the BROWSER console
return Array.from(items).map(/* ... */);
});
console.log(`Extracted ${products.length} products`); // This logs in YOUR terminal
Note the subtle difference: console.log inside page.evaluate() runs in the browser's console (visible in DevTools), while console.log outside it runs in your Node.js terminal.
When to Use an API Instead
Here is the honest truth that scraping tutorials rarely lead with: if there is an API, use the API. Always.
A REST API returns clean, structured JSON data. It is designed for programmatic access. It does not break when the website gets redesigned. It does not trigger anti-bot protection. It does not raise legal questions. And it is almost always faster.
Before you let AI write a scraper, try this prompt first:
Ask first
"Does [website name] have a public API for [the data I need]? If yes, show me how to use it. If no, then write a scraper."
You would be surprised how often there is an API hiding behind the scenes. Weather data, stock prices, government records, social media metrics — all of these have dedicated APIs that are more reliable than scraping.
The Ethical Scraper Checklist
If you do need to scrape, here is a quick checklist to keep yourself in good standing:
- ✅ Check
robots.txtbefore writing a single line of code - ✅ Skim the Terms of Service for anti-scraping language
- ✅ Look for an API first — it is almost always the better path
- ✅ Add delays between requests (minimum 1-2 seconds)
- ✅ Respect
Crawl-delaydirectives - ✅ Identify your scraper honestly (or at least do not impersonate real users at scale)
- ✅ Handle errors gracefully — if you get a 429, back off, do not retry harder
- ✅ Store only the data you need, not entire page copies
- ❌ Do not scrape behind login walls
- ❌ Do not resell or republish scraped content without permission
- ❌ Do not ignore explicit "do not scrape" signals
- ❌ Do not overwhelm small sites — your scraper could crash a small business's server
What to Learn Next
Web scraping connects to several other concepts that are worth understanding as a vibe coder:
- What Is HTML? — Understanding HTML structure is what makes scraping possible. Selectors, tags, and attributes are the language of data extraction.
- What Is JavaScript? — Most scrapers are written in JavaScript. Knowing the basics helps you read and debug what AI generates.
- What Is Node.js? — Scrapers run in Node.js, not the browser. Understanding the runtime helps you install packages, run scripts, and manage dependencies.
- What Is a REST API? — The alternative to scraping. When structured data is available through an API, it is always the better option.
- What Is Async/Await? — Every scraper uses async patterns heavily. Understanding await is essential for debugging scraper timing issues.