Is web scraping legal?

It depends on what you scrape and how you use it. Scraping publicly available data for personal research is generally fine. Scraping behind a login, ignoring robots.txt, violating terms of service, or reselling copyrighted content can create legal problems. The 2022 hiQ v. LinkedIn ruling confirmed that scraping public data is not automatically illegal, but that does not mean every scraping project is risk-free.

What is the difference between Puppeteer and Cheerio?

Cheerio parses raw HTML that has already been downloaded. It is fast and lightweight but cannot interact with pages that load content dynamically with JavaScript. Puppeteer launches a real browser (Chrome) and can click buttons, scroll, wait for content to load, and handle JavaScript-rendered pages. AI picks Cheerio for simple static pages and Puppeteer or Playwright for anything dynamic.

Why does my AI-generated scraper return empty results?

The most common reasons are: the page loads content dynamically with JavaScript after the initial HTML loads, the CSS selectors AI used do not match the actual page structure, the site blocks automated requests based on missing headers or user-agent strings, or the site uses anti-bot protection like Cloudflare. Check the raw HTML your scraper receives to determine which issue applies.

Can I scrape Amazon, Google, or social media sites?

These sites actively block scrapers and have terms of service that prohibit it. They use CAPTCHAs, rate limiting, IP blocking, and legal action. For price data, consider official APIs or affiliate feeds. For search data, use the Google Search API. For social media, most platforms offer developer APIs with rate-limited access to the data you need.

What should I use instead of scraping when an API exists?

Always prefer a REST API over scraping when one is available. APIs return structured data (usually JSON), are designed for programmatic access, and will not break when the website redesigns its HTML. Ask your AI assistant to check for an API first before generating a scraper. Most major services offer free or affordable API tiers.

What Is Web Scraping? When AI Builds You a Data Collector

TL;DR

Web scraping is automated data collection from websites. When you ask AI to grab prices, listings, or content from a site, it generates a scraper — a script that visits pages, reads the HTML, and extracts the data you want. It is powerful, common, and sometimes legally risky.

Why AI Coders Need to Know This

Here is a pattern that plays out constantly in vibe coding: you are building a side project — maybe a price tracker, a job board aggregator, or a tool that monitors competitor inventory. You tell Claude or Cursor something like "pull all the product names and prices from this URL and save them to a CSV." Within seconds, the AI hands you a fully working script.

That script is a web scraper. And unless you understand what it does, you are running code that visits other people's websites on autopilot, potentially hundreds or thousands of times, without knowing whether that is fine or whether it is going to get your IP address blocked, your hosting account suspended, or worse.

Web scraping sits at the intersection of several things vibe coders already work with: JavaScript, Node.js, HTML structure, and async/await patterns. AI generates scrapers so fluently that it is easy to forget you are deploying an automated bot against someone else's infrastructure. This article gives you the knowledge to use scraping responsibly and fix it when it breaks.

What Web Scraping Actually Does

At its core, web scraping is straightforward. A scraper does three things:

Fetches a web page — just like your browser does when you type a URL, but in code.
Reads the HTML — the raw structure of the page, including all the tags, classes, and text content.
Extracts specific data — pulls out the pieces you care about (prices, titles, dates, links) and ignores the rest.

Think of it like this: when you visit a store's website and manually copy prices into a spreadsheet, that is scraping — done by hand. A web scraper automates that process so you can collect data from hundreds or thousands of pages without clicking through each one yourself.

The difference between scraping and using an API is important. An API is a front door — the website deliberately built it so programs can request data in a clean, structured format. Scraping is more like reading the menu posted in the window. You are getting the information from the public-facing page, but you are doing it in a way the site may not have intended for automated tools.

The three tools AI reaches for

When you ask AI to scrape something, it almost always picks one of three Node.js libraries. Each one works differently, and knowing which is which saves you hours of confusion.

Cheerio

What it is: A fast, lightweight HTML parser. It downloads the raw HTML of a page and lets you search through it using CSS selectors — the same selectors you use in stylesheets.

When AI uses it: Simple, static pages where the content is already in the HTML when it loads. Blog posts, news articles, product pages on older sites.

Limitation: If the page loads content with JavaScript after the initial load (like most modern React or Vue apps), Cheerio sees nothing — just an empty shell.

Puppeteer

What it is: A Node.js library that launches a real Chrome browser (headless — no visible window) and controls it programmatically. It can click buttons, fill forms, scroll pages, and wait for content to load.

When AI uses it: Any page that loads content dynamically, requires interaction (like clicking "Load More"), or needs JavaScript to render. E-commerce sites, dashboards, SPAs.

Limitation: Slower and heavier than Cheerio because it runs an entire browser. Uses more memory and CPU.

Playwright

What it is: Similar to Puppeteer but supports Chrome, Firefox, and Safari. Built by Microsoft. Often considered the more modern, reliable choice for browser automation.

When AI uses it: Same use cases as Puppeteer, but with better cross-browser support and more robust waiting mechanisms. Increasingly the default choice in newer AI-generated code.

Limitation: Same overhead as Puppeteer — it launches a real browser. Slightly larger install size since it bundles multiple browser engines.

Quick rule of thumb

If AI imports cheerio, it is parsing static HTML. If it imports puppeteer or playwright, it is launching a real browser. The second approach is more powerful but heavier, slower, and more likely to trigger anti-bot detection.

The Real Scenario: "Scrape Prices From This Site"

Let's walk through what actually happens when you give an AI this kind of prompt:

Your prompt

"Write a Node.js script that scrapes all product names and prices from https://example-store.com/products and saves them to a CSV file."

Here is a realistic version of what AI generates. This example uses Puppeteer because most product pages load content dynamically:

import puppeteer from 'puppeteer';
import { writeFileSync } from 'fs';

async function scrapeProducts() {
  // Launch a headless Chrome browser
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Set a realistic user-agent so the site doesn't block us
  await page.setUserAgent(
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
  );

  // Navigate to the products page and wait for content to load
  await page.goto('https://example-store.com/products', {
    waitUntil: 'networkidle2',
    timeout: 30000
  });

  // Extract product data from the page
  const products = await page.evaluate(() => {
    const items = document.querySelectorAll('.product-card');
    return Array.from(items).map(item => ({
      name: item.querySelector('.product-title')?.textContent?.trim() || 'Unknown',
      price: item.querySelector('.product-price')?.textContent?.trim() || 'N/A'
    }));
  });

  await browser.close();

  // Convert to CSV and save
  const csv = 'Name,Price\n' +
    products.map(p => `"${p.name}","${p.price}"`).join('\n');

  writeFileSync('products.csv', csv);
  console.log(`Scraped ${products.length} products.`);
}

scrapeProducts().catch(console.error);

Let's break down what this code is doing, piece by piece:

puppeteer.launch() — Starts an invisible Chrome browser on your machine. This is a real browser, just without a window.
page.setUserAgent() — Pretends to be a normal browser visitor. Without this, many sites detect automated traffic and block it.
page.goto() — Navigates to the URL, like typing it into your address bar. waitUntil: 'networkidle2' means "wait until the page mostly stops loading."
page.evaluate() — Runs JavaScript inside the browser page. This is where the actual data extraction happens. It finds all elements matching .product-card and pulls out the name and price from each one.
writeFileSync() — Saves the data to a CSV file on your computer.

Notice the async/await pattern throughout. Every browser action takes time — navigating, waiting, extracting — so the code awaits each step before moving on. If you have read our guide on async/await, this pattern should look familiar.

What the Cheerio version looks like

For comparison, here is how AI might write the same scraper using Cheerio for a simpler, static site:

import * as cheerio from 'cheerio';

async function scrapeProducts() {
  const response = await fetch('https://example-store.com/products');
  const html = await response.text();

  // Load the HTML into Cheerio for parsing
  const $ = cheerio.load(html);

  const products = [];
  $('.product-card').each((i, element) => {
    products.push({
      name: $(element).find('.product-title').text().trim(),
      price: $(element).find('.product-price').text().trim()
    });
  });

  console.log(`Found ${products.length} products.`);
  return products;
}

scrapeProducts().catch(console.error);

See the difference? No browser launch, no user-agent trick, no waiting for dynamic content. Cheerio just grabs the raw HTML and parses it. Faster, simpler, but only works when the data is already in the HTML source.

The Legal and Ethical Reality

This is the section most tutorials skip, and it is the one that matters most. AI will happily generate a scraper for any website you point it at. It will not warn you when scraping that site could create legal problems. That is your job.

What is robots.txt?

Almost every website has a file at /robots.txt that tells automated tools what they are and are not allowed to access. For example, https://example.com/robots.txt might say:

User-agent: *
Disallow: /account/
Disallow: /checkout/
Allow: /products/
Crawl-delay: 10

This says: "Bots can access /products/ but should stay away from /account/ and /checkout/, and please wait 10 seconds between requests." Robots.txt is not legally binding in itself, but ignoring it shows bad faith and can be used against you in a legal dispute.

Terms of Service

Most websites include a Terms of Service (ToS) that explicitly prohibits automated data collection. Violating ToS is a breach of contract, and companies like LinkedIn, Meta, and Amazon have sued scrapers over it. The fact that AI wrote the scraper does not change your liability — you deployed it.

When scraping is generally fine

Public data, personal use — Scraping public web pages for personal research, learning, or one-time data gathering is low risk.
Your own sites — Scraping your own website to audit content, check links, or extract data is always fine.
Sites that explicitly allow it — Some sites provide open data or permissive robots.txt files.
Government and public records — Much government data is explicitly public domain.

When scraping gets risky

Behind a login wall — Scraping content that requires authentication is almost always a violation.
At scale against commercial sites — Hammering Amazon with 10,000 requests per hour will get your IP blocked and could invite legal action.
Reselling scraped data — Collecting data is one thing. Repackaging and selling it is a different legal category.
Ignoring explicit prohibitions — If robots.txt says Disallow: / and the ToS says "no automated access," you are operating at your own risk.
Copyrighted content — Scraping articles, images, or creative work to republish is a copyright issue regardless of the method.

The practical standard

Before running any scraper: check /robots.txt, skim the Terms of Service, and ask yourself — "Is there an API I could use instead?" If the answer is yes, use the API. APIs are more reliable, more legal, and less likely to break. Scraping is for when no better option exists.

The hiQ v. LinkedIn case

In 2022, the U.S. Ninth Circuit ruled that scraping publicly available LinkedIn data was not a violation of the Computer Fraud and Abuse Act. This is often cited as "scraping is legal," but the ruling was narrower than that. It said that accessing public data does not constitute unauthorized access under the CFAA. It did not say that scraping is always fine — ToS violations, copyright issues, and state laws still apply.

What Can Go Wrong

AI-generated scrapers break constantly. Not because the AI is bad at writing code, but because scraping is inherently fragile. You are writing code that depends on someone else's website staying exactly the same. Here are the failures you will hit most often:

1. The selectors are wrong

AI guesses the CSS selectors based on common patterns like .product-card or .price. But every site structures its HTML differently. If the real class name is .plp-item__price--current, the scraper returns nothing.

The fix: Open the actual page in Chrome, right-click the element you want, choose "Inspect," and look at the real class names. Give those to your AI.

2. The content loads dynamically

You used Cheerio (or plain fetch), but the page loads its data with JavaScript after the initial HTML arrives. The scraper sees an empty page.

The fix: Switch to Puppeteer or Playwright. Tell the AI: "The content loads dynamically with JavaScript. Use Puppeteer and wait for the product elements to appear before extracting data."

3. The site blocks you

Sites detect automated traffic through missing headers, unusual request patterns, no cookies, or known bot user-agent strings. When detected, they return a block page, a CAPTCHA, or a 403 error instead of the real content.

The fix: Set a realistic user-agent, add reasonable delays between requests (2-5 seconds), and do not hammer the site with hundreds of requests per minute. If Cloudflare blocks you, that is the site telling you to stop.

4. The site structure changes

Your scraper worked perfectly on Monday. On Thursday the site redesigned their product page, and every selector broke. This is the fundamental fragility of scraping — you depend on HTML that someone else controls.

The fix: Accept that scrapers require maintenance. Use broad selectors when possible, add error handling for missing elements, and log warnings when expected data is absent.

5. Rate limiting and IP bans

Making too many requests too fast is the quickest way to get blocked. Some sites return HTTP 429 (Too Many Requests). Others silently serve empty or misleading data. Some ban your IP entirely.

The fix: Always add delays between requests. A respectful scraper waits at least 1-2 seconds between pages. If the site has a Crawl-delay in robots.txt, respect it.

6. AI does not handle pagination

The product list has 50 pages, but AI only scraped the first one. This is extremely common — AI writes a scraper for a single URL and does not think about "Next" buttons or paginated results.

The fix: Tell the AI explicitly: "The results are paginated. Follow the pagination links and scrape all pages, not just the first one. Add a delay between page requests."

How to Debug a Broken Scraper

Debugging scrapers follows a specific pattern that is different from debugging most other code. The issue is almost never a syntax error — it is a mismatch between what AI assumed about the page and what the page actually looks like.

Step 1: See what the scraper actually receives

Before doing anything else, save the raw HTML your scraper downloads and open it in a text editor. Is the content there? If you see the product data in the HTML, the problem is your selectors. If you see an empty shell, a CAPTCHA page, or a "Please enable JavaScript" message, the problem is earlier in the pipeline.

// Add this to your scraper to save the raw HTML
import { writeFileSync } from 'fs';

const html = await page.content();
writeFileSync('debug-page.html', html);
console.log('Saved raw HTML to debug-page.html');

Step 2: Test your selectors in the browser

Open the real website in Chrome. Open DevTools (F12 or Cmd+Option+I). Go to the Console tab and type:

document.querySelectorAll('.product-card').length

If this returns 0, that selector does not match anything on the page. Use the Elements tab to find the correct selectors, then give them to your AI.

Step 3: Run headful (visible browser)

If you are using Puppeteer or Playwright, switch to headful mode so you can see what the browser actually does:

const browser = await puppeteer.launch({
  headless: false,  // Show the browser window
  slowMo: 100       // Slow down actions so you can see them
});

This lets you watch the scraper navigate, click, and extract. You will immediately see if it is landing on a CAPTCHA page, a cookie consent dialog, or a different page than expected.

Step 4: Give AI the right debugging prompt

When your scraper fails, this prompt pattern works well:

Debugging prompt template

"My scraper returns 0 products. Here is the code: [paste code]. Here is the raw HTML the scraper receives: [paste first 200 lines of debug-page.html]. The actual product elements on the page use the class [paste real class from DevTools]. Update the selectors to match the real page structure. Keep the existing delay and error handling."

That prompt gives the AI everything it needs: the code, the actual HTML, the correct selectors, and constraints to prevent it from rewriting your entire script.

Step 5: Add logging everywhere

Scrapers fail silently far too often. Add console.log statements at every stage:

console.log('Navigating to page...');
await page.goto(url);
console.log('Page loaded. Extracting data...');

const products = await page.evaluate(() => {
  const items = document.querySelectorAll('.product-card');
  console.log(`Found ${items.length} items on page`);  // This logs in the BROWSER console
  return Array.from(items).map(/* ... */);
});

console.log(`Extracted ${products.length} products`);  // This logs in YOUR terminal

Note the subtle difference: console.log inside page.evaluate() runs in the browser's console (visible in DevTools), while console.log outside it runs in your Node.js terminal.

When to Use an API Instead

Here is the honest truth that scraping tutorials rarely lead with: if there is an API, use the API. Always.

A REST API returns clean, structured JSON data. It is designed for programmatic access. It does not break when the website gets redesigned. It does not trigger anti-bot protection. It does not raise legal questions. And it is almost always faster.

Before you let AI write a scraper, try this prompt first:

Ask first

"Does [website name] have a public API for [the data I need]? If yes, show me how to use it. If no, then write a scraper."

You would be surprised how often there is an API hiding behind the scenes. Weather data, stock prices, government records, social media metrics — all of these have dedicated APIs that are more reliable than scraping.

The Ethical Scraper Checklist

If you do need to scrape, here is a quick checklist to keep yourself in good standing:

✅ Check robots.txt before writing a single line of code
✅ Skim the Terms of Service for anti-scraping language
✅ Look for an API first — it is almost always the better path
✅ Add delays between requests (minimum 1-2 seconds)
✅ Respect Crawl-delay directives
✅ Identify your scraper honestly (or at least do not impersonate real users at scale)
✅ Handle errors gracefully — if you get a 429, back off, do not retry harder
✅ Store only the data you need, not entire page copies
❌ Do not scrape behind login walls
❌ Do not resell or republish scraped content without permission
❌ Do not ignore explicit "do not scrape" signals
❌ Do not overwhelm small sites — your scraper could crash a small business's server

What to Learn Next

Web scraping connects to several other concepts that are worth understanding as a vibe coder:

What Is HTML? — Understanding HTML structure is what makes scraping possible. Selectors, tags, and attributes are the language of data extraction.
What Is JavaScript? — Most scrapers are written in JavaScript. Knowing the basics helps you read and debug what AI generates.
What Is Node.js? — Scrapers run in Node.js, not the browser. Understanding the runtime helps you install packages, run scripts, and manage dependencies.
What Is a REST API? — The alternative to scraping. When structured data is available through an API, it is always the better option.
What Is Async/Await? — Every scraper uses async patterns heavily. Understanding await is essential for debugging scraper timing issues.