🕷️Developer

Web Scraping with Python: What Actually Works in 2026

A practical guide to scraping web data with requests and BeautifulSoup — including the anti-bot measures you'll encounter and how to handle them ethically.

PSBy Priya Shah · Senior Software EngineerFebruary 15, 20268 min read

Free to read

Web scraping gets a bad reputation, most of it undeserved. Scraping public data for research, price tracking, or automation is a legitimate use case with a long history. Here's how to do it correctly.

Before Scraping: Check These First

1robots.txt: example.com/robots.txt shows what the site explicitly disallows crawling. Disallowed paths in robots.txt are off-limits by convention.
2Public API: many sites provide one (Twitter/X, Reddit, GitHub). An API is always preferable — structured data, no parsing needed, officially supported.
3Terms of Service: look for 'scraping', 'crawling', 'automated access'. Commercial use of scraped data often requires permission.
4Rate limits: if you must scrape, how fast is too fast? Mimicking human browsing speed (seconds between requests) is a reasonable baseline.

The Basic Pattern with requests + BeautifulSoup

scraper.py

import requests
from bs4 import BeautifulSoup
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; research-bot/1.0)'
}

def scrape_page(url):
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()  # raises on 4xx/5xx
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find elements by CSS selector
    titles = soup.select('h2.article-title')
    return [t.get_text(strip=True) for t in titles]

urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
    data = scrape_page(url)
    print(data)
    time.sleep(2)  # be polite

Finding the Right CSS Selectors

Open browser DevTools, right-click the element you want, 'Inspect'. You'll see the HTML. Right-click the HTML element in DevTools → 'Copy' → 'Copy selector' for an auto-generated CSS selector. These auto-selectors are often overly specific (nth-child(3) of nth-child(2)...) — look at the actual element for cleaner selectors based on class names or IDs.

When You Need a Real Browser

JavaScript-rendered content requires Playwright (modern choice) or Selenium. Playwright example:

playwright_scrape.py

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://example.com')
    page.wait_for_selector('.content')
    
    data = page.eval_on_selector_all('.product-title',
        'elements => elements.map(e => e.textContent)')
    print(data)
    browser.close()

Frequently Asked Questions

Is web scraping legal?+

It depends on what you scrape, what you do with the data, and who you ask. Scraping publicly accessible information for personal use, research, or journalism is generally accepted. Violating terms of service, scraping behind a login without permission, republishing scraped content, or using scraping to enable copyright infringement are all legally or ethically problematic. The hiQ v. LinkedIn ruling (2022) established that scraping publicly accessible data doesn't violate the CFAA (Computer Fraud and Abuse Act) in the US, but ToS violations are separate civil matters. When in doubt: check the site's ToS, check robots.txt, and if building a commercial product, get legal advice.

Why does my scraper return empty results even though the data is on the page?+

Most commonly: the content is loaded by JavaScript after the initial page load. requests fetches the raw HTML before JavaScript runs, so dynamic content isn't there. Solutions: use Selenium or Playwright to drive a real browser that executes JavaScript; find the underlying API the page's JavaScript calls (check Network tab in dev tools) and call it directly — often easier and faster than browser automation; look for a public API that provides the same data.

How do I avoid getting blocked while scraping?+

Respect robots.txt (it signals what the site doesn't want scraped). Add delays between requests (time.sleep(random.uniform(1, 3)) randomizes the interval). Rotate User-Agent strings to look like different browsers. Use session objects to persist cookies. Avoid scraping at peak traffic hours (makes your requests harder to distinguish). Don't make 100 requests per second — that's a denial of service, not scraping. Ethical scraping looks like a slow human, not a firehose.

What's the difference between requests and Scrapy?+

requests + BeautifulSoup is the right starting point — simple, synchronous, easy to understand, enough for most one-off scraping tasks. Scrapy is a full scraping framework — handles crawling multiple pages, following links, storing results, concurrent requests, middleware for retries and proxies. Scrapy's learning curve is steeper but it's the right tool for large-scale projects that need to scrape thousands of pages or run as a scheduled job. Start with requests; graduate to Scrapy when you outgrow it.

🔧 Free Tools Used in This Guide

Regex Tester Json Formatter

Priya Shah

Senior Software Engineer · 9+ years experience

Priya has nine years of experience building distributed systems and developer tooling at two B2B SaaS companies. She writes about APIs, JSON/JWT workflows, regex, DevOps, and the small utilities that make debugging faster at 2am.

View all posts by Priya Shah →

Tags:

pythonweb-scrapingdeveloperautomation

Continue Reading

🔍

Regex for People Who Keep Putting It Off

9 min read

⌨️

Linux Terminal Commands That Make You Actually Productive

8 min read