Playwright is a powerful browser automation tool from Microsoft, used for testing, scraping, or automating web interactions. But sometimes, you don’t want to load everything, especially when you're scraping content or speeding up test execution.
Unnecessary resources like JavaScript, stylesheets, images, videos, and even ads can:
- Slow down page loading
- Consume extra bandwidth
- Add noise to your scraping data
✅ Syntax to block CSS file in playwright
await page.route('**/*.css', (route) => { // and abort the request route.abort(); });
✅ Syntax to block JS file in playwright
await context.route('**/*.js', (route) => route.abort());
✅ Block Requests by Domain
await page.route('**/*', (route) => { // block all traffic from the offending domain if (route.request().url().includes('www.yahoo.com')) { return route.abort(); } // allow all other traffic through route.continue(); });
✅ Block Requests by Content Type
await page.route('**/*', (route) => { if (route.request().resourceType() === 'image') { return route.abort(); } route.continue(); });
✅ Block Requests by Arbitrary Logic
await page.route('**/*', (route) => { const req = route.request(); // block by method if (req.method() === 'DELETE') { return route.abort(); } // block by header if (req.allHeaders()['X-Source']?.includes('dangerous')) { return route.abort(); } // block by body if (req.postDataJSON()?.length >= 3) { return route.abort(); } route.continue(); });
Block Requests for a Single Page
In this example we will see how to block request for single page. Playwright Page class provides a method for monitoring traffic and using that we can control the traffic in single page.
import { test, expect } from '@playwright/test'; import { chromium } from 'playwright'; const browser = await pw.chromium.launch(); const context = await browser.newContext(); const page = await context.newPage(); // watch traffic matching a pattern await page.route('**/*.css', (route) => { // and abort the request route.abort(); });
Block Requests Across All Pages
In this cases, rather than setting route handlers on the Page object, you can instead set handlers on the Context object. This goes for route() as well as unroute(). But the syntax is exactly the same.
import { test, expect } from '@playwright/test'; import { chromium } from 'playwright'; const browser = await pw.chromium.launch(); const context = await browser.newContext(); const page = await context.newPage(); // watch the entire browser context await context.route('**/*.js', (route) => route.abort()); // no JS loaded anywhere! const page1 = await context.newPage(); await page1.goto('/'); const page2 = await context.newPage(); await page2.goto('/'); // enable JS on future requests await context.unroute('**/*.js');
Blocking Resources
Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:
const { chromium } = require("playwright"); // Block pages by resource type (e.g., image, stylesheet) const BLOCK_RESOURCE_TYPES = [ "beacon", "csp_report", "font", "image", "imageset", "media", "object", "texttrack", // We can even block stylesheets and scripts, though it's not recommended: // 'stylesheet', // 'script', // 'xhr', ]; // Block popular third-party resources like tracking const BLOCK_RESOURCE_NAMES = [ "adzerk", "analytics", "cdn.api.twitter", "doubleclick", "exelator", "facebook", "fontawesome", "google", "google-analytics", "googletagmanager", ]; // Function to intercept and block requests const interceptRoute = (route) => { const request = route.request(); // Block by resource type if (BLOCK_RESOURCE_TYPES.includes(request.resourceType())) { console.log( `Blocking background resource: ${request.url()} (blocked type: ${request.resourceType()})` ); return route.abort(); } // Block by resource name (URL) if (BLOCK_RESOURCE_NAMES.some((key) => request.url().includes(key))) { console.log( `Blocking background resource: ${request.url()} (blocked name)` ); return route.abort(); } // Continue all other requests return route.continue(); }; (async () => { const browser = await chromium.launch({ headless: false, // Enable devtools to see total resource usage devtools: true, }); const context = await browser.newContext({ viewport: { width: 1920, height: 1080 }, }); const page = await context.newPage(); // Enable intercepting for all requests await page.route("**/*", interceptRoute); // Navigate to the Twitch Art directory await page.goto("https://www.google.com"); await page.waitForSelector('[name=q]'); // Close the browser await browser.close(); })();
This is all about how to intercept and block these resources using Playwright's built-in request routing.
No comments:
Post a Comment