Browsermob Proxy and Selenium, knowing when requests finish

This is part 2 of my Python, Selenium, Fargate posts.

Part 1 — Run a Python Selenium web scraper on AWS Fargate
Part 2 — Adding Browsermob Proxy to sniff traffic and have more confidence in whether the website you’re trying to scrape has loaded (this)
Part 3 — exception handling strategies for when something inevitably crashes

While trying to get campsite availability for Torres App, I ran into issues with a site that was using AJAX requests. I didn’t have a good way of knowing when the site was finished loading and using a sleep timer seemed very crude. Enter Browsermob Proxy. This allowed me to definitively know when the requests I cared about are finished.

I am going to start with the project in part 1 and add browsermob proxy support.

The code for this project can be found here

Step 1 – Running locally

Prerequisites:

Make sure you have a Java Runtime Environment (JRE) installed on your machine. To check, open your terminal and type java -v. If not, please search how to do that as the directions are often changing.
Install browsermob-proxy by downloading the zip and placing it somewhere on your system. I put it in /usr/local/bin so my full path became /usr/local/bin/browsermob-proxy-2.1.4/bin/browsermob-proxy
To verify you installed it successfully, you should be able to start the proxy from the command line as described here
Install the latest commit of browsermob-proxy-py with pip install git+https://github.com/AutomatedTester/browsermob-proxy-py.git@48accb7f17e776397f98ece63eabcd755d606f3e#egg=browsermob-proxy

We’re going to use browsermob proxy to keep track of all the requests selenium makes through the browser. I found it helpful to use the contextmanager decorator to handle the browser and proxy to make sure they’re properly disposed off no matter how the program exists. Let’s get started.

selenium_browsermob/settings.py

class Config(object):
    CHROME_PATH = '/Library/Application Support/Google/chromedriver76.0.3809.68'
    BROWSERMOB_PATH = '/usr/local/bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'


class Docker(Config):
    CHROME_PATH = '/usr/local/bin/chromedriver'

Building on the previous example, we just added a new variable BROWSERMOB_PATH

selenium_browsermob/main.py

import contextlib
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
import argparse

from selenium_browsermob import settings

config = settings.Config

Adding a couple more includes

@contextlib.contextmanager
def browser_and_proxy():
    server = Server(config.BROWSERMOB_PATH)
    server.start()
    proxy = server.create_proxy()
    proxy.new_har(options={'captureContent': True})

Setting up the browsermob proxy and importantly setting our HAR to capture request content.

    # Set up Chrome
    option = webdriver.ChromeOptions()
    option.add_argument('--proxy-server=%s' % proxy.proxy)

    prefs = {"profile.managed_default_content_settings.images": 2}
    option.add_experimental_option("prefs", prefs)
    option.add_argument('--headless')
    option.add_argument('--no-sandbox')
    option.add_argument('--disable-gpu')

    capabilities = DesiredCapabilities.CHROME.copy()
    capabilities['acceptSslCerts'] = True
    capabilities['acceptInsecureCerts'] = True

    path = config.CHROME_PATH
    browser = webdriver.Chrome(options=option,
                               desired_capabilities=capabilities,
                               executable_path=path)

    try:
        yield browser, proxy
    finally:
        browser.quit()
        server.stop()

Setting up chrome to use the proxy, disable loading images to speed up loading, and support accepting insecure certs to support the proxy

def demo():
    with browser_and_proxy() as (browser, proxy):
        browser.get('https://www.airbnb.com/s/Seattle--WA--United-States/homes')
        first_listing_page_1 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
        print('First Listing Page 1: {}'.format(first_listing_page_1.text))
        page_2 = browser.find_element_by_xpath('//li[@data-id="page-2"]')
        page_2.click()
        first_listing_page_2 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
        print('First Listing Page 2: {}'.format(first_listing_page_2.text))

Changing up the demo code to open up AirBnB and print try to print out the first listing on page 1 and 2.

When I run this, the first listing on page 2 is either the same as page 1 or I get a StaleElementReferenceException which means the element has changed.

Clearly, we have an issue and we can fix this with the help of the browsermob proxy.

By using network traffic monitor in chrome, I figured out that the important airbnb requests are those to https://www.airbnb.com/api/v2/explore_tabs.

Let’s add a function to help us debug.

def scan_har(proxy):
    for e in proxy.har['log']['entries']:
        if 'explore_tabs' in e['request']['url']:
            print('Request URL: {}'.format(e['request']['url']))
            print('Status: {}'.format(e['response']['status']))

And call it as part of the demo

def demo():
    with browser_and_proxy() as (browser, proxy):
        browser.get('https://www.airbnb.com/s/Seattle--WA--United-States/homes')
        first_listing_page_1 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
        scan_har(proxy)
        print('First Listing Page 1: {}'.format(first_listing_page_1.text))
        page_2 = browser.find_element_by_xpath('//li[@data-id="page-2"]')
        page_2.click()
        scan_har(proxy)
        first_listing_page_2 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
        print('First Listing Page 2: {}'.format(first_listing_page_2.text))

Here’s the output:

Request URL: https://www.airbnb.com/api/v2/explore_tabs...
Status: 200
First Listing Page 1: Cozy Private Loft Apt with Balcony
ENTIRE LOFT
Cozy Private Loft Apt with Balcony
2 guests · Studio · 1 bed · 1 bath
Wifi · Kitchen
202 reviews
202 · Superhost
Price:
$80/night

Request URL: https://www.airbnb.com/api/v2/explore_tabs...
Status: 200
Request URL: https://www.airbnb.com/api/v2/explore_tabs...
Status: 0
First Listing Page 2: Cozy Private Loft Apt with Balcony
202 reviews

I highlighted the problem, when we look for the listing on the second page the request to AirBnB hasn’t had a chance to execute.

We can fix this by updating our scan_har to be scan_and_block_har as follows and change the demo code to call it

def scan_and_block_har(proxy):
    all_requests_finished = False
    while all_requests_finished is False:
        all_requests_finished = True
        for e in proxy.har['log']['entries']:
            if 'explore_tabs' in e['request']['url']:
                if e['response']['status'] != 200 or e['response']['content'].get('text') is None:
                    all_requests_finished = False
    return

The function looks for every request that has ‘explore_tabs’ in the request and makes sure that we have a proper response.

Now when we run the script the output is what we expect.

First Listing Page 1: Cozy Private Loft Apt with Balcony
 ENTIRE LOFT
 Cozy Private Loft Apt with Balcony
 2 guests · Studio · 1 bed · 1 bath
 Wifi · Kitchen · Washer
 4.86
 202 reviews
 (202)
 · Superhost
 Price:
 $80/night
 First Listing Page 2: Inviting room, easy walk to lake and light rail.
 PRIVATE ROOM IN HOUSE
 Inviting room, easy walk to lake and light rail.
 2 guests · 1 bedroom · 1 bed · 1 shared bath
 Wifi
 4.92
 123 reviews
 (123)
 · Superhost
 Price:
 $39/night

Step 2 – Runs locally in Docker

To get the above to run in Docker we need to add browsermob-proxy and java to our Dockerfile as follows

FROM python:3.6

# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update --fix-missing
RUN apt-get install -y google-chrome-stable

# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

# install browsermob proxy, needed for selenium_browsermob
RUN wget -O /tmp/browsermob-proxy.zip https://github.com/lightbody/browsermob-proxy/releases/download/browsermob-proxy-2.1.4/browsermob-proxy-2.1.4-bin.zip
RUN unzip /tmp/browsermob-proxy.zip -d /usr/local/bin/

# install java, needed for selenium_browsermob
RUN apt-get install -y openjdk-8-jre

# set display port to avoid crash
ENV DISPLAY=:99

# Install requirements first so this step is cached by Docker
COPY /requierments.txt /home/selenium-aws-fargate-demo/requirements.txt
WORKDIR /home/selenium-aws-fargate-demo/
RUN pip install -r requirements.txt

# copy code
COPY ./ /home/selenium-aws-fargate-demo/
RUN python setup.py install

We can rebuild the docker image and test to see if it works with the following commands

docker build -t selenium-aws-fargate-demo .
docker run -it selenium-aws-fargate-demo python selenium_browsermob/main.py Docker

You should see the program run successfully.

Step 3 – Running in Fargate

To run in AWS you just need to push the new image to ECR and update your task with the new command as described in my previous post.

Browsermob Proxy and Selenium, knowing when requests finish

Step 1 – Running locally

Step 2 – Runs locally in Docker

Step 3 – Running in Fargate

Author

Comments

Write a Reply or Comment Cancel reply