Homework 7 - Robots

Due by 11:59pm on 2023-12-06.

Objectives

  • Learn about web scraping etiquette
  • Learn about robots.txt
  • Create the RequestGuard module to enforce web scraping etiquette

Starter Files

Download hw07.zip. Inside the archive, you will find the test files for this homework.

Introduction

When accessing files on the World Wide Web via a program, there are etiquette rules to be followed. One of these is to respect the wishes of the web site owners and not accessing files specified in the the site's robots.txt file. This file lists all paths and files on a domain that a "robot" or automated script should not access but which are meant only to be accessed by human visitors.

For Project 4, it is absolutely critical that your program respects the robots.txt file and limits its page loads to the initially specified domain. In addition to just being proper web etiquette, if your program wanders off the specified site and starts traversing the entire internet it could have negative repercussions for you, your fellow students, and the University. This has happened in the past and BYU has been blocked from accessing certain important websites. Please keep this in mind and don't point your program at major websites. Limit it to small sites you control or the ones we give you to test with. After the semester is over, and if you are accessing the web from a non-BYU connection, you can do whatever you want.

This homework focus on building the functionality you'll need in your project to read and respect the robots.txt file for a given domain. You will build a small module that your full web crawler in Project 4 will use.

Part 1 - Getting Started

Task 1 - Installing libraries

For this homework, you'll need the requests library installed. If you've done Lab 21 you will already have them installed but if not, you need to install them now.

To do so, open a terminal window where you normally run your python commands and install the package using pip:

pip install requests

This should install the request library. You can test that it was installed correctly by opening up a python interpreter and running the following command:

import requests

If that returns an error, something didn't work. If it just returns the interpreter prompt, you are good to go.

Task 2 - Create the RequestGuard module

The code for this project will be contained in a module called RequestGuard. Start by creating a RequestGuard.py file. Code in this file will be using the requests library so go ahead and import that at the top of your file.

You should be writing tests to verify your code is doing what it should. You can either add the if __name__ == "__main__": line to your module and writes tests below that or create pytest or doctest tests in the file along with your code.

Part 2 - Handling robots.txt

With everything set up, let's start coding.

Because of the importance of this functionality, we'll walk you through this fairly carefully. In the final project, this will also be the first thing the autograder will test and it will test it explicitly. If it isn't working properly, it will not test the rest of the project and you will not receive any points on the project.

Task 1 - Obtain and parse robots.txt

Because this is so important, we're going to write this as a RequestGuard class that can be tested completely independent of the project. In your RequestGuard.py file, create a RequestGuard class. Use the same capitalization, as the autograder creates objects of this class. This class will be fairly simple, and only needs four methods: __init__(), can_follow_link(), make_get_request(), and parse_robots(). You may add additional helper methods as appropriate.

First we will write the parse_robots() method. It will take no parameters besides self. It will read the domain instance variable initialized in the __init__() method, which we will write later. It will return a list of the paths to exclude as given in the robots.txt file, which it will request from the domain. Later the can_follow_link() method will refer to this list.

To get the data from the robots.txt file, we need to download the file, read each line, and parse it to add the prohibited paths to the list we will return. By the web protocol, this file will be located at the top level of the domain and so will be found at

<domain>/robots.txt

Depending on the situation, your RequestGuard could be be given a simple domain (such as https://cs111.byu.edu) or a url that also has a path on the end (such as https://cs111.byu.edu/hw/hw07/). Because of this, you must make your RequestGuard strip the path (such as /hw/hw07/) off of any url passed in so that it can properly access robots.txt at <domain>/robots.txt (in this case https://cs111.byu.edu/robots.txt).

Using the requests library, get the robots.txt file. These have the form of:

User-agent: *
Disallow: /data
Disallow: /images/jpg
Disallow: /Projects/Project4/Project4.md
...

The User-agent tag specifies what user agents the following rules apply to. Often it is * which means all agents. For this lab you will ignore this line and assume that all the following rules apply to your program.

The Dissallow lines are the items that your program should ignore. They are all paths relative to the top of the domain. So /data means that anything in <domain>/data should not be downloaded by your program. You need to extract these entries from the file and create a list of the disallowed items. The first item /data is an example of a top level directory, /images/jpg means ignore anything in the <domain>/images/jpg subdirectory, but <domain>/images and <domain>/images/png are fine to access. The final line in the example is simply disallowing a specific file.

Again, this is an excellent place to use regular expressions to look for the "Disallow" text and capture the contents that follow it.

Your parse_robots() method should return a list of the paths to avoid. Given the robots.txt file in the example above, parse_robots() should return this list:

["/data", "/images/jpg", "Projects/Project4/Project4.md"]

Task 2 - Limiting where your program goes

The __init__() method should take a url as its input parameter, and store the domain of the url as an instance variable. Domains are of the form https://cs111.byu.edu while a given url such as https://cs111.byu.edu/hw/hw07/ might be passed in. In such a case, you must strip the path (/hw/hw07/) off of the domain before storing it. The constructor should also call the parse_robots() method from Task 1, and store the returned list of forbidden paths in an instance variable called self.forbidden. By calling parse_robots() in the constructor, we ensure that any RequestGuard objects our code encounters later will already be aware of the forbidden paths before we call can_follow_link().

Note: The pytest relies on the variable where you store forbidden paths to be called self.forbidden so that it can properly test your RequestGuard class

Now we will write the can_follow_link() method. This method takes a URL as input and does the following checks:

  1. Does the link start with the stored domain? - If it doesn't, return False. Otherwise proceed to the next check.
  2. Does the link contain any of the paths specified in the list generated from the path list provided? - If it does, return False, otherwise return True.

This is a good opportunity to use regular expressions to match the domain and the individual paths in the robots.txt file.

Note that for the second condition, the paths have to immediately follow the domain. So if the domain is "https://cs111.byu.edu" and an item in the path list is "/data" then a URL that begins "https://cs111.byu.edu/data" would match condition 2 and should return false but a URL like "https://cs11.byu.edu/Projects/Project1/data" would not match because the "/data" part of the URL is not immediately after the domain.

Examples: If the domain provided is "https://cs111.byu.edu" and the forbidden paths list is ['/data', '/images', '/lectures'] then you should get the following return values for the specified inputs:

  • "https://byu.edu" - False - doesn't match the domain
  • "https://cs111.byu.edu/HW/HW01" - True - matches the domain and doesn't match anything in the path list
  • "https://cs111.byu.edu/images/logo.png" - False - matches the domain but also matches /images
  • "https://cs111.byu.edu/data/spectra1.txt" - False - matches the domain but also matches /data
  • "https://cs111.byu.edu/Projects/Project4/images/cat.jpg" - True - matches the domain and doesn't match any of the paths. It contains /images but not immediately following the domain.

Task 3 - Wrapping the requests.get method

Lastly, we must write the make_get_request() method. This method is very simple. It checks if the link it is passed can be followed, and if it can, returns whatever requests.get() would return if passed the same arguments. If the link cannot be followed, it returns None.

Because requests.get() can sometimes take function arguments other than just url, we must add extra optional parameters to our make_get_request method to support that. We can do this by basically putting a variable assignment expression directly in the function definition like this:

def myfunc(arg1, arg2 = None):
innerfunc(arg1, arg2)

In the above example, whenever we call myfunc, we only ever need to pass in arg1, and arg2 will simply default to None. However, if we do pass in a value for arg2, it will override the default "None" value, and be passed on to innerfunc. For your your make_get_request method, it will look something like this:

def make_get_request(url, use_stream = False):
# DO YOUR CHECKS HERE
requests.get(url, stream = use_stream)

Turn in your work

You'll submit your RequestGuard.py file on Canvas via Gradescope where it will be checked via the auto grader. Make sure that you haven't "hard coded" anything specific to the test data we gave you. We do not guarantee that all scenarios are tested by the test data that we have provided you.

© 2023 Brigham Young University, All Rights Reserved