Lab 19 - Requests and Beautiful Soup Library

Due by 11:59pm on 2023-04-02.

Starter Files

Download lab19.zip. Inside the archive, you will find starter files for the questions in this lab.

Topics

Requests Library

Installing the Requests Library

To install the Requests library, type one of the following into the terminal

pip install requests

python3 -m pip install requests

To check if you correctly downloaded the library, type the following in the python interpreter (type python3 into the terminal) and ensure it does not error:

>>> import requests

To uninstall any library, type pip uninstall library_name or python3 -m pip uninstall library_name

Requests Review

When asking for a web page from some server on the internet, your electronic or client can send several different types of requests. In this lab, we will only be looking at GET requests. The Requests library allow us to grab and utilize the information from the request in our code. To do this, you have to use the get function. Import the requests library and write this code,

response_object = requests.get("URL")

where "URL" would be a string containing an actual URL.

The response_object has several attributes we can use. For example, we have the following attributes:

url - Returns the URL of the web page requested
text - Returns the web page's content in HTML
content - Returns the web page's content of the response as bytes
status_code - Returns a code representing the status of retrieving the web page. For more information refer to this wikipedia article. The most common status code is 200 which means that everything went OK when retrieving the web page.
headers - Returns a dictionary with additional information passed in with a request.

Try out the following code:

response_object = requests.get("https://cs111.byu.edu")
print(response_object.url)
print(response_object.status_code)
print(response_object.headers)

Who is the mascot mentioned in the headers? What was the message to posterity?

Refer to w3schools for more information on the attributes: https://www.w3schools.com/python/ref_requests_response.asp

Beautiful Soup Library

Installing the Beautiful Soup Library

To install the Beautiful Soup library, type one of the following into the terminal

pip install beautifulsoup4

python3 -m pip install beautifulsoup4

To check if you correctly downloaded the library, type the following in the python interpreter (by typing python3 into the terminal) and ensure it does not error:

>>> import bs4

To uninstall any library, type pip uninstall library_name or python3 -m pip uninstall library_name

Beautiful Soup Review

One way of accessing a web sites information is by accessing its HTML and parsing it. One library or framework that allows us to parse HTML somewhat easily is the Beautiful Soup library. To start, your code should look similar to the following

import bs4, requests

r = requests.get("URL")
soup_object = bs4.BeautifulSoup(r.content, features="html.parser") # `features` prevent an unimportant warning for this class

Once we have a beautiful soup object, we have access to several useful methods.

.prettify() returns a string of all the web page's HTML nicely indented and formatted.
- Add the following to the code above, provide a valid URL, and demo it:
```
print(soup_object.prettify()) 
```
  Compare it to the actual HTML contents of the web page given by the URL.
.find_all('tag') takes in as an argument a string containing the tag and returns a list of tag objects.
- For example, if we wanted to find all the paragraph () tags in a website, we could do the following:
```
>>> list_of_tags = soup_object.find_all('p')
>>> print(list_of_tags)
[Computer Science is amazing!, I want to become a CS Major!]
```

Tags also have several useful methods and attributes:

.get('attr')
- In some cases, we want to access a tag's attributes. Given a tag object, we can use its .get() method to access its attribute's content. For example, if we wanted to access an image tag's width attribute, we can do the following:
```
>>> img_tag = soup_object.find('img') # .find returns one tag instead of a list of tags
>>> width = img_tag.get('width') 
>>> print(width)
111
```
- If the attribute does not exist, the method will return None.

.attrs

.attrs is a dictionary. The keys of .attrs is the attribute names and the values are strings containing the attribute's associated information. For example, with the tag Hello, World:

>>> from bs4 import BeautifulSoup
>>> data = '<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>'
>>> soup = BeautifulSoup(data, features="html.parser")
>>> tag = soup.find('p')
>>> print(tag.attrs)
{'style': 'font-size: 12px; font-color: blue;', 'id': 'this_id'}
>>> print(tag.attrs['style'])
'font-size: 12px; font-color: blue;'
>>> print(tag.get('style'))
'font-size: 12px; font-color: blue;'

Required Questions

Q1: Downloading a Web Page

Write a function called download that has the parameters url and output_filename. The function should get the HTML text from the url, open the file, and write it to the provided output_filename.

When getting the text, use .text rather than .content.

To test, get the HTML contents from the CS111's pair programming article and write them to a file called lab19_test.txt. Compare your file to the HTML content on the webpage.

def download(url, output_filename):
    "*** YOUR CODE HERE ***"

Test your code:

python3 -m pytest test_lab19.py::test_download

Q2: Prettify

Write a function called make_pretty which takes in a url and an output_filename. The function should save the results of calling .prettify() on the web page given by the url to the output_filename.

def make_pretty(url, output_filename):
    "*** YOUR CODE HERE ***"

Test your code:

python3 -m pytest test_lab19.py::test_make_pretty

Q3: Finding Paragraphs

Write a function called find_paragraphs which takes in a url and an output_filename. The function should find all paragraph tags  on the webpage given by the url and write them to the output_filename.

def find_paragraphs(url, output_filename):
    "*** YOUR CODE HERE ***"

Test your code:

python3 -m pytest test_lab19.py::test_find_paragraphs

Q4: Finding Links

Write a function called find_links which takes in a url and an output_filename and finds all the hrefs in a web page and writes them to the provided output_filename. Each link should be on its own line (1 link per line).

def find_links(url, output_filename):
    "*** YOUR CODE HERE ***"

Test your code:

python3 -m pytest test_lab19.py::test_find_links

Note: If you see something like /staff or #syllabus-course-policies in your output file when testing it yourself, these are valid links. The CS111 website was designed with links like that.

Submit

Submit the lab19.py file on Canvas to Gradescope in the window on the assignment page.