This is part 3 of my Python, Selenium, Fargate posts.

  • Part 1 — Run a Python Selenium web scraper on AWS Fargate
  • Part 2 — Adding Browsermob Proxy to sniff traffic and have more confidence in whether the website you’re trying to scrape has loaded
  • Part 3 — exception handling strategies for when something inevitably crashes (this)

The scrape job that powers Torres App takes about 4 hours to run. Early on, it would often fall victim to random Selenium exceptions such as:

  • NoSuchElementException
  • TimeoutException
  • WebDriverException

Browsers crash and the internet sometimes doesn’t work, so I needed to make my scraper resilient to this, here’s what I did building on the previous post in the series.

The code for this example is here.

selenium_exceptions/main.py

import contextlib
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException
import argparse
import traceback
import time

from selenium_exceptions import settings

config = settings.Config

Updated the include list

def retry(func, *args):
    retries = 10
    while retries > 0:
        try:
            return func(*args)
        except (NoSuchElementException, TimeoutException, WebDriverException) as e:
            if retries > 0:
                retries -= 1
                print("Retries left {}, Continuing on {}".format(retries, traceback.format_exc()))
                time.sleep(5)
            else:
                raise e

I added a function that caught the exceptions and retried the scraping script.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('config', type=str, nargs='?', help='the config class')
    args = parser.parse_args()
    config = getattr(settings, args.config)
    retry(demo)

Finally, changed the main call to use the retry function.

Obviously, this example is trivial but this pattern can be used along with an object to maintain some state and have the browser restart part way through the scrape job after a crash.

Last modified: August 10, 2019

Author

Comments

Write a Reply or Comment

Your email address will not be published.