Tuesday, April 29, 2025

Block Scripts, Styles, Media in Playwright

 Playwright is a powerful browser automation tool from Microsoft, used for testing, scraping, or automating web interactions. But sometimes, you don’t want to load everything, especially when you're scraping content or speeding up test execution.

Unnecessary resources like JavaScript, stylesheets, images, videos, and even ads can:

  1. Slow down page loading
  2. Consume extra bandwidth
  3. Add noise to your scraping data
Block Scripts, Styles, Media in Playwright




 Syntax to block CSS file in playwright

await page.route('**/*.css', (route) => {
  // and abort the request
  route.abort();
});

 Syntax to block JS file in playwright

await context.route('**/*.js', (route) => route.abort());

 Block Requests by Domain

await page.route('**/*', (route) => {
  // block all traffic from the offending domain
  if (route.request().url().includes('www.yahoo.com')) {
    return route.abort();
  }

  // allow all other traffic through
  route.continue();
});


 Block Requests by Content Type

await page.route('**/*', (route) => {
  if (route.request().resourceType() === 'image') {
    return route.abort();
  }

  route.continue();
});

 Block Requests by Arbitrary Logic
await page.route('**/*', (route) => {
  const req = route.request();

  // block by method
  if (req.method() === 'DELETE') {
    return route.abort();
  }

  // block by header
  if (req.allHeaders()['X-Source']?.includes('dangerous')) {
    return route.abort();
  }

  // block by body
  if (req.postDataJSON()?.length >= 3) {
    return route.abort();
  }

  route.continue();
});

Block Requests for a Single Page

In this example we will see how to block request for single page. Playwright Page class provides a method for monitoring traffic and using that we can control the traffic in single page.

import { test, expect } from '@playwright/test';
import { chromium } from 'playwright';

const browser = await pw.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// watch traffic matching a pattern
await page.route('**/*.css', (route) => {
  // and abort the request
  route.abort();
});


Block Requests Across All Pages

In this cases, rather than setting route handlers on the Page object, you can instead set handlers on the Context object. This goes for route() as well as unroute(). But the syntax is exactly the same.
import { test, expect } from '@playwright/test';
import { chromium } from 'playwright';

const browser = await pw.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// watch the entire browser context
await context.route('**/*.js', (route) => route.abort());

// no JS loaded anywhere!
const page1 = await context.newPage();
await page1.goto('/');
const page2 = await context.newPage();
await page2.goto('/');

// enable JS on future requests
await context.unroute('**/*.js');

Blocking Resources

Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:

const { chromium } = require("playwright");

// Block pages by resource type (e.g., image, stylesheet)
const BLOCK_RESOURCE_TYPES = [
  "beacon",
  "csp_report",
  "font",
  "image",
  "imageset",
  "media",
  "object",
  "texttrack",
  // We can even block stylesheets and scripts, though it's not recommended:
  // 'stylesheet',
  // 'script',
  // 'xhr',
];

// Block popular third-party resources like tracking
const BLOCK_RESOURCE_NAMES = [
  "adzerk",
  "analytics",
  "cdn.api.twitter",
  "doubleclick",
  "exelator",
  "facebook",
  "fontawesome",
  "google",
  "google-analytics",
  "googletagmanager",
];

// Function to intercept and block requests
const interceptRoute = (route) => {
  const request = route.request();

  // Block by resource type
  if (BLOCK_RESOURCE_TYPES.includes(request.resourceType())) {
    console.log(
      `Blocking background resource: ${request.url()} (blocked type: ${request.resourceType()})`
    );
    return route.abort();
  }

  // Block by resource name (URL)
  if (BLOCK_RESOURCE_NAMES.some((key) => request.url().includes(key))) {
    console.log(
      `Blocking background resource: ${request.url()} (blocked name)`
    );
    return route.abort();
  }

  // Continue all other requests
  return route.continue();
};

(async () => {
  const browser = await chromium.launch({
    headless: false,
    // Enable devtools to see total resource usage
    devtools: true,
  });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  // Enable intercepting for all requests
  await page.route("**/*", interceptRoute);

  // Navigate to the Twitch Art directory
  await page.goto("https://www.google.com");
  await page.waitForSelector('[name=q]');

  // Close the browser
  await browser.close();
})();


This is all about how to intercept and block these resources using Playwright's built-in request routing.

No comments:

Post a Comment