Lab 20 - URL Tools

Due by 11:59pm on 2023-11-09.

Starter Files

Write your code for this lab in the empty lab20.py file provided in lab20.zip.

Topics

URL Structure

We've looked at URLs a decent amount up to this point, but now we are going to give concrete names (based on urllib) to the different parts you might find in a given url:

scheme : The scheme of a url is the very beginning part, right before the www. piece. It is nearly always https or http for sites you will visit.
netloc : This is the classic domain that we have been discussing. This is the cs111.byu.edu part or the www.google.com part.
path : The path of a url is everything that comes after the domain. For the current url you are looking at, this would be /lab/lab20. Note that this path could also end in a file format such as /lectures/Stephens/Lecture30-Hyperlinks.pptx. In this case, your browser will often prompt you to download the file rather than redirect to a new webpage.
query : Sometimes on websites some information about will be stored in the url using a ?. For example, on youtube if you click on a video, you will notice the url ends in /watch?v=<some characters>. We won't use url queries in this class, but they are what follows that ?.
fragment : A fragment is the #<header name> part that is sometimes added to the end of a URL. It stores your location within a single webpage. Try clicking one of the links on the left side of this webpage and you'll see the associated fragment added to your url. This type of link will come up later in this class.

Urllib Library

When writing programs and bots to access the internet, an important part of the process is url parsing. While in a perfect world your program would always be given a domain (such as https://cs111.byu.edu) from the start, if your program is following a link from a different webpage this might not be the case. Urllib exists to solve this problem.

Urllib has most of the link processing we need built in by default. The library is split up into four modules:

urllib.request : Used for fetching and reading web links. We won't use this module since we already have the requests library.
urllib.parse : Used for parsing urls. This is the main part of urllib that we will use.
urllib.error : Used for managing any errors thrown by urllib.request.
urllib.robotparser : Used for determining which parts of a website you are permitted to access and not (robots.txt). This topic should have been covered soemewhat in lecture already.

Importing Urllib

To import urllib, you must import each module you want to use individually. For example, to use functions from urllib.parse you would have to use either of the two import statements below:

import urllib.parse
# OR
from urllib import parse

or to import just the function(s) you want:

from urllib.parse import urlparse, urljoin

Urllib.parse

As mentioned above, urllib.parse is the module we can use to parse our urls into parts which are easier to work with.

Although it has quite a few more functions which can be found by reading the docs, the relevant functions for this class are as follows:

urllib.parse.urlparse(<url>) : Given a full url, parses the url into its associated parts as discussed above, returning a ParseResult object. The parts can then be accessed by calling the associated attributes of the ParseResult object that is returned. (ex. scheme = urlparse(<url>).scheme)
urllib.parse.urlsplit(<url>) : Functions the same as urlparse, except that it simply returns a tuple of the form (scheme, network location, path, query, fragment) rather than a ParseResult object.
urllib.parse.urljoin(<url>, <url to join>) : Given a base URL as well as a second URL, joins the two, taking parts from the base URL as needed, and replacing other parts by the second URL. urljoin examples: urljoin('https://cs111.byu.edu', '/lab/lab20/) -> 'https://cs111.byu.edu/lab/lab20/' urljoin('https://cs111.byu.edu/lab', 'lab20/assets/') -> 'https://cs111.byu.edu/lab/lab20/assets/' urljoin('https://cs111.byu.edu/proj/proj4/' 'https://cs111.byu.edu/lab/lab20/') -> 'https://cs111.byu.edu/lab/lab20/'

Note: Notice how in the first two examples the only difference is whether or not there is a '/' at the beginning of the second URL. This leading '/' determines whether the second URL is a relative path from the domain or the current directory respectively.

Given a url such as https://cs111.byu.edu/lab/lab20/, the following code shows how to use urllib.parse.urlparse to get the separate pieces:

>>> from urllib.parse import urlparse
>>> parsed = urlparse('https://cs111.byu.edu/lab/lab20/')
>>> parsed.scheme
'https'
>>> parsed.netloc
'cs111.byu.edu'
>>> parsed.path
'/lab/lab20/'
>>> parsed.fragment
''

Required Questions

Q1: Get Domain

Write a function called get_domain. Given a url as a function parameter, it should return the full domain of the url, or an empty string if there is no full domain in the url. It should also return an empty string if the scheme of the url is not valid (we'll consider http and https valid)

Note: You may need to use some string manipulation in addition to urllib to acheive this functionality.

Example output:

>>> get_domain('https://cs111.byu.edu/lab/lab20/')
'https://cs111.byu.edu'
>>> get_domain('http://en.wikipedia.org/w/index.php')
'http://en.wikipedia.org'
>>> get_domain('proj/proj4/')
'' # an empty string

Q2: Combine Paths

Write a function called combine_paths. Given a url and a path to another page on the same website, it should return the full url to the other page. You can expect that your function will only be given valid URLs and paths.

Example output:

>>> combine_paths('https://cs111.byu.edu/lab/lab15/', '/lab/lab20/')
'https://cs111.byu.edu/lab/lab20/'
>>> combine_paths('https://cs111.byu.edu/hw/hw03/#part-2', '/articles/about/')
'https://cs111.byu.edu/articles/about/'

Q3: Combine URLs

Write a function called combine_urls. Given two urls (a base url and a url to join), it should return the full URL of the webpage pointed to by the url to join. Try using a built in urllib function help with this task.

Example output:

>>> combine_urls('https://cs111.byu.edu/lab/lab15/', '/lab/lab20/')
'https://cs111.byu.edu/lab/lab20/'
>>> combine_urls('https://cs111.byu.edu/lab/lab08', 'lab20/')
'https://cs111.byu.edu/lab/lab20/'
>>> combine_urls('https://cs111.byu.edu/hw/hw05/', 'https://www.wikipedia.org')
'https://www.wikipedia.org'
>>> combine_urls('https://cs111.byu.edu/lab/lab20/assets/page1.html', 'page2.html')
'https://cs111.byu.edu/lab/lab20/assets/page2.html'

Q4: Print Pages

Write a function called print_pages. Given a url, a list of paths and pages, and an output file name, it should visit each of those paths and pages and write the contents of ALL of the pages to the same output file. The start of each page should be written on its own line. Each new page/path should be combined with the full url of the previous page visited (eg. the first path/page will need to be combined with the url that is passed into the function, and the second path/page will need to be combined with the full url of the first page, etc.)

Paths vs Pages: If the item in the list that you are processing starts with a forward slash (/lab/lab20/assets/page1.html), it is a path, and should be appended to the domain directly and visited. If the item does not begin with a forward slash (page2.html), it is a page in the same "folder" as the previous file processed (https://cs111.byu.edu/lab/lab20/assets/page2.html).

Example output:

>>> print_pages('https://cs111.byu.edu', ['/lab/lab20/assets/page1.html', 'page2.html'], 'pages.output.txt') # no print output because you are just writing to the output file
>>> print_pages('https://cs111.byu.edu/proj/proj4/', ['/lab/lab20/assets/page1.html', 'page2.html'], 'pages.output.txt')

Both of the above function calls should write the same thing to the output file:

you found page1! good job.
this line is from page2.html

Hint: You will need to use requests as well for this part of the assignment.

Submit

Submit the lab20.py file on Canvas to Gradescope in the window on the assignment page.