TL;DR: AI can build a working web scraper from a single well-crafted prompt. You need Python installed, a few packages (requests, beautifulsoup4), and a target website. This guide gives you the prompts, the full generated script with inline comments, how to handle pagination and JavaScript-heavy sites, and the legal/ethical rules you must follow before scraping anything.
What You Will Build
By the end of this guide you will have a working Python scraper that:
- Fetches a public product listing page (we'll use books.toscrape.com — a site built specifically for scraping practice)
- Extracts each book's title, price, and star rating
- Saves the results to a
.csvfile you can open in Excel or Google Sheets - Handles multiple pages automatically
The final script is about 60 lines of Python. You will not write it — you will prompt AI to write it, then read it to understand what's happening. That reading step is non-negotiable. You should never run code you cannot explain, especially code that makes network requests.
Before You Start
Make sure Python 3.10+ is installed on your machine. Open a terminal and type python --version or python3 --version. If you see a version number, you're good. If not, download Python from python.org. You'll also need pip to install the required packages.
The AI Prompts That Build It
This is the core value of this guide. Most tutorials give you the finished code and explain it. That does not help you when you need to scrape a different site. What helps is knowing how to prompt AI so you can generate a scraper for any target.
Here is the exact sequence of prompts to use. Run them in Claude, Cursor, or Windsurf — they all work.
Prompt 1: The Foundation Script
Prompt to Copy and Use
I want to build a Python web scraper. Here are the requirements:
Target site: https://books.toscrape.com
Data to extract from each book listing:
- Title
- Price (as a number, strip the £ symbol)
- Star rating (convert "One", "Two", "Three" etc. to 1, 2, 3)
Output: Save all results to a CSV file called books.csv with columns: title, price, rating
Requirements:
- Use the requests library to fetch the page
- Use BeautifulSoup to parse the HTML
- Add a 1-second delay between requests to be polite
- Add error handling for failed requests
- Add inline comments explaining what each section does
- Only scrape the first page for now
Please write the complete script.
That prompt is specific about the target, the data fields, the output format, the libraries, and the behavior. Vague prompts get vague code. The more context you give AI, the less debugging you do later.
Prompt 2: Inspect the Target First
Before running the scraper, you need the actual CSS selectors for the site. Open the target URL in your browser, right-click on a book title, and select "Inspect." You'll see the HTML structure. Then give AI this follow-up:
Prompt to Copy and Use
Here is the HTML for a single book card on books.toscrape.com:
<article class="product_pod">
<div class="image_container">...</div>
<p class="star-rating Three"></p>
<h3><a href="catalogue/..." title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
...
</div>
</article>
Update the scraper to use the correct CSS selectors to extract:
1. Title from the <a> tag's "title" attribute inside the <h3>
2. Price from the <p class="price_color"> tag
3. Star rating from the second class on <p class="star-rating">
Show me the updated extraction code only — I'll drop it into the script.
Prompt 3: Install the Dependencies
Prompt to Copy and Use
What pip install commands do I need to run before this scraper will work?
Give me the exact terminal commands for macOS/Linux and Windows.
AI will respond with: pip install requests beautifulsoup4. Simple. Run that in your terminal first.
What AI Generated: The Full Script
Below is the complete scraper Claude produces from Prompt 1, with the selector corrections from Prompt 2 applied. Read through the comments — they are your primary learning tool here.
import requests
from bs4 import BeautifulSoup
import csv
import time
# The base URL of the site we are scraping
BASE_URL = "https://books.toscrape.com"
START_URL = "https://books.toscrape.com/catalogue/page-1.html"
# Map word-form ratings to integers
RATING_MAP = {
"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
}
def fetch_page(url):
"""
Fetch a single page and return the response text.
Returns None if the request fails.
"""
headers = {
# Identify ourselves as a browser to avoid simple bot blocks
"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # raise an error for 4xx/5xx status codes
return response.text
except requests.RequestException as e:
print(f" Request failed for {url}: {e}")
return None
def parse_books(html):
"""
Parse the HTML of one listing page and return a list of book dicts.
Each dict has keys: title, price, rating.
"""
soup = BeautifulSoup(html, "html.parser")
books = []
# Each book card is wrapped in an <article class="product_pod">
for article in soup.select("article.product_pod"):
# --- Title ---
# The full title lives in the "title" attribute of the <a> inside <h3>
# (the visible text is truncated; the attribute is always complete)
title_tag = article.select_one("h3 a")
title = title_tag["title"] if title_tag else "Unknown"
# --- Price ---
# Price is in <p class="price_color"> and includes a £ symbol
price_tag = article.select_one("p.price_color")
if price_tag:
# Strip the £ sign and any whitespace, convert to float
price = float(price_tag.text.strip().replace("£", "").replace("Â", ""))
else:
price = 0.0
# --- Star Rating ---
# The rating is encoded as a second CSS class on <p class="star-rating">
# e.g. <p class="star-rating Three"> means 3 stars
rating_tag = article.select_one("p.star-rating")
if rating_tag:
# Get all classes, remove "star-rating", the remainder is the word
classes = rating_tag.get("class", [])
word = next((c for c in classes if c != "star-rating"), "Zero")
rating = RATING_MAP.get(word, 0)
else:
rating = 0
books.append({
"title": title,
"price": price,
"rating": rating,
})
return books
def get_next_page_url(html, current_url):
"""
Find the URL of the next page, or return None if we're on the last page.
"""
soup = BeautifulSoup(html, "html.parser")
next_btn = soup.select_one("li.next a")
if not next_btn:
return None # no next button = last page
# The href is relative to the catalogue directory
next_href = next_btn["href"]
return BASE_URL + "/catalogue/" + next_href
def save_to_csv(books, filename="books.csv"):
"""
Write a list of book dicts to a CSV file.
"""
if not books:
print("No books to save.")
return
fieldnames = ["title", "price", "rating"]
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(books)
print(f"Saved {len(books)} books to {filename}")
def main():
all_books = []
url = START_URL
print("Starting scrape of books.toscrape.com...")
while url:
print(f" Fetching: {url}")
html = fetch_page(url)
if html is None:
print(" Stopping — failed to fetch page.")
break
# Extract books from this page
page_books = parse_books(html)
all_books.extend(page_books)
print(f" Found {len(page_books)} books (total so far: {len(all_books)})")
# Find the next page URL — returns None when we hit the last page
url = get_next_page_url(html, url)
if url:
# Be polite: wait 1 second between requests
time.sleep(1)
save_to_csv(all_books)
print("Done.")
if __name__ == "__main__":
main()
Run this with python scraper.py (or python3 scraper.py) and after about two minutes — there are 50 pages — you will have a books.csv file with 1,000 rows.
Understanding Each Part
HTTP Requests: Fetching the Page
The requests library is how Python talks to web servers. When you call requests.get(url), Python opens a TCP connection to the server, sends an HTTP GET request, and receives the HTML response — exactly what a browser does, minus rendering.
Two things matter here: the User-Agent header (identifies your scraper to the server; some sites block requests without one) and error handling (raise_for_status() throws an exception if the server returns a 4xx or 5xx status code so you know about failures immediately instead of silently writing bad data to your CSV).
If you want to understand HTTP more deeply, read What Is an API? — scraping is essentially calling an unofficial API by reading the HTML directly.
HTML Parsing: Finding the Data
BeautifulSoup takes raw HTML text and turns it into a searchable tree of objects. Think of it as a query engine for HTML. Once you have a BeautifulSoup object, you can ask it: "give me all elements that match this description."
The two most useful methods:
soup.select("css selector")— returns a list of all matching elements. Use this when you expect multiple results (like all book cards on a page).soup.select_one("css selector")— returns the first matching element, orNone. Use this inside a loop when you're extracting one field per item.
CSS Selectors: Pointing at the Right Data
CSS selectors are the way you tell BeautifulSoup which elements you want. If you know basic CSS, you already know this syntax. If not, the three patterns you'll use 90% of the time are:
article.product_pod— elements of type<article>with the classproduct_podp.price_color— a<p>tag with the classprice_colorh3 a— any<a>tag that is a descendant of an<h3>tag
When a scraper breaks, it is almost always because a CSS selector stopped matching after the site updated its HTML. The fix is to re-inspect the element in browser DevTools and update the selector.
Data Extraction: Getting Values Out
Once you have an element, you pull data from it in two ways:
element.text.strip()— gets the visible text content (e.g.,"£51.77")element["attribute_name"]— gets an HTML attribute value (e.g.,title_tag["title"]gets thetitle="..."attribute)
The star rating trick in the script — reading the second CSS class off the element — is a good example of creative extraction. Data is not always in the text. Sometimes it's encoded in classes, data attributes, or URL parameters.
Saving to CSV
Python's built-in csv module handles writing to CSV without any extra install. DictWriter is the cleanest approach: you define the column names (fieldnames), then pass a list of dictionaries — one per row — and it writes them out in order. Open the result in Excel or Google Sheets and you have a spreadsheet ready for analysis.
Level Up: Adding Pagination
The script above already handles pagination — the get_next_page_url() function finds the "next" button on each page and follows it until there are no more pages. Here is the prompt to add pagination to any scraper that doesn't already have it:
Pagination Prompt
My scraper currently only gets page 1. The site has multiple pages.
Here is the HTML for the pagination at the bottom of the page:
[paste the HTML of the pagination section here]
Add a get_next_page_url() function that:
- Finds the "next" button link
- Returns the full absolute URL of the next page
- Returns None when there is no next page
Then update main() to loop through all pages until get_next_page_url returns None.
Add a 1-second sleep between each page request.
The pattern is always the same: find the "next page" link in the HTML, construct its full URL, follow it, repeat. The while url: loop in the script handles this cleanly — when get_next_page_url() returns None, the loop exits naturally.
Limit Your Scope During Development
When testing a new scraper, add a page limit: if page_count >= 3: break. Scraping 3 pages is enough to verify the script works. Only remove the limit once you've confirmed the data looks right. Mistakes at scale mean hammering a server with hundreds of bad requests.
Level Up: Using Playwright for JavaScript-Heavy Sites
BeautifulSoup only sees the HTML that the server sends back in the initial response. Many modern sites — job boards, e-commerce platforms, real estate listings — load their data after the initial page load, using JavaScript. BeautifulSoup will fetch the page and see an empty shell.
The test: right-click the page in your browser and select "View Page Source" (not Inspect — View Source). If the data you want is not in that raw HTML, the site is JavaScript-rendered and you need Playwright.
Playwright controls a real browser (Chromium, Firefox, or WebKit) from Python. It loads the page like a regular user, waits for JavaScript to run, and then gives you the fully-rendered HTML to parse.
Install Playwright
pip install playwright
playwright install chromium
The Playwright Prompt
Playwright Conversion Prompt
My scraper uses requests + BeautifulSoup but the target site loads data with JavaScript,
so BeautifulSoup sees empty content.
Rewrite the fetch_page() function to use Playwright instead of requests.
Requirements:
- Use playwright.sync_api (synchronous, not async — I'm not using async/await)
- Launch Chromium in headless mode
- Navigate to the URL
- Wait for the main content to appear (selector: "[paste the selector for the main content here]")
- Return the page's HTML after JavaScript has rendered
- Close the browser when done
Keep parse_books(), save_to_csv(), and main() exactly as they are —
only replace fetch_page().
Playwright is slower than plain requests (it's launching a real browser) and uses more memory, so only reach for it when you genuinely need it. For most public data sites, requests + BeautifulSoup is faster and simpler.
What AI Gets Wrong
AI-generated scrapers have reliable failure modes. Here are the ones that will bite you.
No Rate Limiting
If you ask AI for a scraper without mentioning rate limits, it will often generate code with no delays between requests. This can send hundreds of requests per second to a server — which looks like a denial-of-service attack, will likely get your IP banned, and is genuinely harmful to small sites running on shared hosting. Always include time.sleep(1) between requests. For respectful scraping of commercial sites, 2-3 seconds is safer.
No Error Handling
AI will frequently generate scrapers that assume every request succeeds and every selector matches. In practice, pages time out, servers return 429 (Too Many Requests), and HTML structure varies. Without try/except blocks and raise_for_status(), one failed request crashes the entire script and you lose all your data. Ask for error handling explicitly.
Fragile Selectors
AI will sometimes guess at CSS selectors without seeing the actual HTML. These guesses are often wrong or match the wrong elements. Always inspect the real HTML first (browser DevTools → right-click → Inspect) and paste the relevant section into your prompt. Do not trust AI to know what a site's HTML looks like — it doesn't.
Ignoring robots.txt
AI will never check robots.txt for you. That's your job. See the ethics section below.
Hardcoded Delays That Are Too Short
AI often suggests 0.5-second delays that feel fast in testing but cause issues at scale. A 500ms delay over 200 pages means a request every half second — aggressive for a polite scraper. Start with 1 second minimum, and increase for sites where you see 429 responses.
Ethics and Legality: Read This Section
Web scraping occupies a complicated legal and ethical space. The technology is neutral — the same code can power legitimate research or cause real harm. Here is what you need to know.
Check robots.txt First
Every well-behaved site publishes a robots.txt file at https://example.com/robots.txt. It specifies which parts of the site automated bots are allowed to access. Look for Disallow rules that apply to your target pages. Respecting robots.txt is not legally required everywhere, but ignoring it is bad practice and in some jurisdictions has been used as evidence of bad faith in litigation.
Read the Terms of Service
Most sites' Terms of Service have a clause about automated access or commercial use of their data. Violating the ToS can result in your account being banned, civil legal action, or — for large-scale scraping — prosecution under computer fraud laws. If you're scraping for personal use, learning, or non-commercial research, risk is generally low. If you're scraping to build a commercial product, get a lawyer's opinion.
When Scraping is Generally OK
- Publicly available data (no login required)
- Personal use, research, or journalism
- At a rate that doesn't impact the server
- When the site's robots.txt doesn't prohibit it
- When no official API exists for the data you need
When Scraping Creates Risk
- Scraping behind a login (accessing private data)
- Scraping at high frequency (server impact)
- Republishing scraped copyrighted content commercially
- Circumventing technical access controls (CAPTCHA bypassing)
- Violating explicit ToS prohibitions for commercial use
Always Prefer an API
Before writing a scraper, spend 10 minutes looking for an official API. Many sites that look unscrapable have public APIs — Reddit, Twitter/X, GitHub, Wikipedia, Google Maps, Yelp, and hundreds of others. APIs are faster, more reliable, legal by design, and don't break when the site redesigns. Scraping is the last resort, not the first tool.
The Golden Rule of Scraping
Scrape only what you need, at a speed the server can handle, from sites that don't prohibit it. If in doubt, email the site owner and ask for data access. Many will say yes — especially for research or non-commercial use.
What to Learn Next
You have a working scraper. Here are the natural next steps depending on where you want to go:
- What Is Python? — understand the language your scraper runs in, so you can modify it confidently.
- What Is Web Scraping? — the concepts behind scraping: HTTP, the DOM, HTML parsing, and when it makes sense.
- What Is an API? — learn how to get data the official way, which is almost always better than scraping.
- What Is pip? — understand the Python package manager you used to install requests and BeautifulSoup.
- What Is HTML? — deeper understanding of HTML structure makes your CSS selectors much stronger.
- What Is SQL? — when CSV files aren't enough, store your scraped data in a real database you can query.
- What Is Deployment? — run your scraper on a schedule from a server instead of your laptop.
From here, the most useful upgrades to your scraper are: storing results in a database instead of CSV (SQLite is the easiest starting point — ask AI to "rewrite save_to_csv to use SQLite instead"), scheduling it to run daily (ask about cron on Mac/Linux or Task Scheduler on Windows), and adding a Slack or email notification when new data appears.
Next Step
Run the books.toscrape.com scraper first — it's free, legal, and built for practice. Once you have 1,000 rows in a CSV, open it in Google Sheets and sort by rating descending. You just built a book recommendation engine from scratch. Then adapt the selectors for a site you actually care about.
FAQ
Not deeply. AI will generate the full script for you. But you should be able to read Python well enough to understand what the code does before you run it — especially for scraping, where a bug can hammer a server with thousands of requests. Understanding the basics of Python syntax and how to install packages with pip will get you far.
It depends on what you scrape and how. Scraping publicly available data for personal use, research, or journalism is generally accepted. Scraping in violation of a site's Terms of Service, at a rate that disrupts the server, or for commercial use of copyrighted content can create legal risk. Always check robots.txt, read the ToS, add polite rate limits, and when in doubt, look for an official API instead.
Scrapers depend on the HTML structure of the target page. When a site redesigns its layout, renames CSS classes, or restructures its DOM, your CSS selectors stop matching and you get empty results or errors. This is called selector rot. The fix is to re-inspect the page and update your selectors. AI can help: paste the new HTML and ask it to update the selectors.
Use Playwright when the data you want is loaded by JavaScript after the initial page load. If you view the page source (Ctrl+U) and the data isn't there — but it is visible in the browser — the site is using JavaScript rendering. BeautifulSoup only sees the raw HTML; Playwright controls a real browser and waits for JavaScript to run before extracting the data.
An API is an official, structured way for a website to give you its data. A scraper extracts data by parsing the HTML a browser would render. APIs are more reliable, faster, and usually legal by design. Scraping is a workaround for when no API exists. Always check for an API first — many sites that look unscrapable have a public API that gives you exactly what you need.