This is part 2 of my Python, Selenium, Fargate posts.
- Part 1 — Run a Python Selenium web scraper on AWS Fargate
- Part 2 — Adding Browsermob Proxy to sniff traffic and have more confidence in whether the website you’re trying to scrape has loaded (this)
- Part 3 — exception handling strategies for when something inevitably crashes
While trying to get campsite availability for Torres App, I ran into issues with a site that was using AJAX requests. I didn’t have a good way of knowing when the site was finished loading and using a sleep timer seemed very crude. Enter Browsermob Proxy. This allowed me to definitively know when the requests I cared about are finished.
I am going to start with the project in part 1 and add browsermob proxy support.
The code for this project can be found here
Step 1 – Running locally
Prerequisites:
- Make sure you have a Java Runtime Environment (JRE) installed on your machine. To check, open your terminal and type
java -v
. If not, please search how to do that as the directions are often changing. - Install browsermob-proxy by downloading the zip and placing it somewhere on your system. I put it in
/usr/local/bin
so my full path became/usr/local/bin/browsermob-proxy-2.1.4/bin/browsermob-proxy
- To verify you installed it successfully, you should be able to start the proxy from the command line as described here
- Install the latest commit of browsermob-proxy-py with
pip install git+https://github.com/AutomatedTester/browsermob-proxy-py.git@48accb7f17e776397f98ece63eabcd755d606f3e#egg=browsermob-proxy
We’re going to use browsermob proxy to keep track of all the requests selenium makes through the browser. I found it helpful to use the contextmanager decorator to handle the browser and proxy to make sure they’re properly disposed off no matter how the program exists. Let’s get started.
selenium_browsermob/settings.py
class Config(object): CHROME_PATH = '/Library/Application Support/Google/chromedriver76.0.3809.68' BROWSERMOB_PATH = '/usr/local/bin/browsermob-proxy-2.1.4/bin/browsermob-proxy' class Docker(Config): CHROME_PATH = '/usr/local/bin/chromedriver'
Building on the previous example, we just added a new variable BROWSERMOB_PATH
selenium_browsermob/main.py
import contextlib
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
import argparse
from selenium_browsermob import settings
config = settings.Config
Adding a couple more includes
@contextlib.contextmanager
def browser_and_proxy():
server = Server(config.BROWSERMOB_PATH)
server.start()
proxy = server.create_proxy()
proxy.new_har(options={'captureContent': True})
Setting up the browsermob proxy and importantly setting our HAR to capture request content.
# Set up Chrome
option = webdriver.ChromeOptions()
option.add_argument('--proxy-server=%s' % proxy.proxy)
prefs = {"profile.managed_default_content_settings.images": 2}
option.add_experimental_option("prefs", prefs)
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-gpu')
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
path = config.CHROME_PATH
browser = webdriver.Chrome(options=option,
desired_capabilities=capabilities,
executable_path=path)
try:
yield browser, proxy
finally:
browser.quit()
server.stop()
Setting up chrome to use the proxy, disable loading images to speed up loading, and support accepting insecure certs to support the proxy
def demo():
with browser_and_proxy() as (browser, proxy):
browser.get('https://www.airbnb.com/s/Seattle--WA--United-States/homes')
first_listing_page_1 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
print('First Listing Page 1: {}'.format(first_listing_page_1.text))
page_2 = browser.find_element_by_xpath('//li[@data-id="page-2"]')
page_2.click()
first_listing_page_2 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
print('First Listing Page 2: {}'.format(first_listing_page_2.text))
Changing up the demo code to open up AirBnB and print try to print out the first listing on page 1 and 2.
When I run this, the first listing on page 2 is either the same as page 1 or I get a StaleElementReferenceException which means the element has changed.
Clearly, we have an issue and we can fix this with the help of the browsermob proxy.
By using network traffic monitor in chrome, I figured out that the important airbnb requests are those to https://www.airbnb.com/api/v2/explore_tabs
.
Let’s add a function to help us debug.
def scan_har(proxy):
for e in proxy.har['log']['entries']:
if 'explore_tabs' in e['request']['url']:
print('Request URL: {}'.format(e['request']['url']))
print('Status: {}'.format(e['response']['status']))
And call it as part of the demo
def demo():
with browser_and_proxy() as (browser, proxy):
browser.get('https://www.airbnb.com/s/Seattle--WA--United-States/homes')
first_listing_page_1 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
scan_har(proxy)
print('First Listing Page 1: {}'.format(first_listing_page_1.text))
page_2 = browser.find_element_by_xpath('//li[@data-id="page-2"]')
page_2.click()
scan_har(proxy)
first_listing_page_2 = browser.find_element_by_xpath('//div[contains(@id,"listing")]')
print('First Listing Page 2: {}'.format(first_listing_page_2.text))
Here’s the output:
Request URL: https://www.airbnb.com/api/v2/explore_tabs... Status: 200 First Listing Page 1: Cozy Private Loft Apt with Balcony ENTIRE LOFT Cozy Private Loft Apt with Balcony 2 guests · Studio · 1 bed · 1 bath Wifi · Kitchen 202 reviews 202 · Superhost Price: $80/night Request URL: https://www.airbnb.com/api/v2/explore_tabs... Status: 200 Request URL: https://www.airbnb.com/api/v2/explore_tabs... Status: 0 First Listing Page 2: Cozy Private Loft Apt with Balcony 202 reviews
I highlighted the problem, when we look for the listing on the second page the request to AirBnB hasn’t had a chance to execute.
We can fix this by updating our scan_har to be scan_and_block_har as follows and change the demo code to call it
def scan_and_block_har(proxy):
all_requests_finished = False
while all_requests_finished is False:
all_requests_finished = True
for e in proxy.har['log']['entries']:
if 'explore_tabs' in e['request']['url']:
if e['response']['status'] != 200 or e['response']['content'].get('text') is None:
all_requests_finished = False
return
The function looks for every request that has ‘explore_tabs’ in the request and makes sure that we have a proper response.
Now when we run the script the output is what we expect.
First Listing Page 1: Cozy Private Loft Apt with Balcony ENTIRE LOFT Cozy Private Loft Apt with Balcony 2 guests · Studio · 1 bed · 1 bath Wifi · Kitchen · Washer 4.86 202 reviews (202) · Superhost Price: $80/night First Listing Page 2: Inviting room, easy walk to lake and light rail. PRIVATE ROOM IN HOUSE Inviting room, easy walk to lake and light rail. 2 guests · 1 bedroom · 1 bed · 1 shared bath Wifi 4.92 123 reviews (123) · Superhost Price: $39/night
Step 2 – Runs locally in Docker
To get the above to run in Docker we need to add browsermob-proxy and java to our Dockerfile as follows
FROM python:3.6
# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update --fix-missing
RUN apt-get install -y google-chrome-stable
# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
# install browsermob proxy, needed for selenium_browsermob
RUN wget -O /tmp/browsermob-proxy.zip https://github.com/lightbody/browsermob-proxy/releases/download/browsermob-proxy-2.1.4/browsermob-proxy-2.1.4-bin.zip
RUN unzip /tmp/browsermob-proxy.zip -d /usr/local/bin/
# install java, needed for selenium_browsermob
RUN apt-get install -y openjdk-8-jre
# set display port to avoid crash
ENV DISPLAY=:99
# Install requirements first so this step is cached by Docker
COPY /requierments.txt /home/selenium-aws-fargate-demo/requirements.txt
WORKDIR /home/selenium-aws-fargate-demo/
RUN pip install -r requirements.txt
# copy code
COPY ./ /home/selenium-aws-fargate-demo/
RUN python setup.py install
We can rebuild the docker image and test to see if it works with the following commands
docker build -t selenium-aws-fargate-demo .
docker run -it selenium-aws-fargate-demo python selenium_browsermob/main.py Docker
You should see the program run successfully.
Step 3 – Running in Fargate
To run in AWS you just need to push the new image to ECR and update your task with the new command as described in my previous post.
Comments