Homework 1 - College Admissions Algorithms

This assignment was originally developed for CSCI1200 and INFO1201 at University of Colorado Boulder (with instructors Felix Muzny and Jason Zeitz), with the help of teaching assistants Mikhaila Friske and Jessie Smith, and ethics researchers Casey Fiesler and Natalie Garrett. Please feel free to use and adapt this assignment however you like. You can email casey.fiesler@colorado.edu with any questions.

Objectives

PLEASE READ ALL INSTRUCTIONS CAREFULLY BEFORE STARTING

Verify you have the basic prerequisite knowledge for the class
Practice file I/O
Understand how to iterate through a list
Practice implementing functions that return values
Explore algorithmic decision-making and representations of people in data
Understanding outliers and data exploration

Introduction

As you know, the college admissions process involves a lot of types of data from prospective students to make decisions. With the number of applicants increasing, colleges may begin relying on algorithms to select which applications should receive more intensive human review. An algorithm could use quantitative data--such as GPA and SAT score--to provide initial recommendations. In fact, there is more data available than ever. Many colleges even track data about prospective student engagement

e.g., whether they open emails, visit the college website, engage on social media, etc. This creates a “demonstrated interest” value.

Based on a recent survey of college admissions officers, we know some of the weights that humans tend to give to these different types of data. Your task will be to create a program that iterates through a list of data points and provides a recommendation for which prospective students are likely to be the best candidates for admission.

Prospective student data is organized in the admission_algorithms_dataset.csv file such that the data for each student is on one line, with the values separated by commas. Examples of student data might be:

Student,SAT,GPA,Interest,High School Quality,Sem1,Sem2,Sem3,Sem4
Abbess Horror ,1300,3.61,10,7,95,86,91,94
Adele Hawthorne ,1400,3.67,0,9,97,83,85,86
Adelicia von Krupp ,900,4,5,2,88,92,83,72

The data includes (in order):

Student: a unique identifier for each student (their name)
SAT score: value between 0 and 1600
GPA: value between 0 and 5
Interest: value between 0 and 10 (from very low interest to very high interest)
High School Quality: value between 0 and 10 (from very low-quality to very high-quality high school)
Semester 1: average grade for semester 1
Semester 2: average grade for semester 2
Semester 3: average grade for semester 3
Semester 4: average grade for semester 4

Getting Started

To get started on this homework you should download hw01.zip and extract the contents into a new directory. The contents of the file are:

admissions.py - this is the file you will edit and turn in for grading
admission_algorithms_dataset.csv - a data file you can use for testing your code
Key files for testing your code. - Each of these files contain the correct output from the homework when processing the admission_algorithms_dataset.csv file. The files match the expected output filename but with key_ prepended to the name. Thus, the key_student_scores.csv file should exactly match your student_scores.csv file.
- key_student_scores.csv
- key_chosen_students.csv
- key_outliers.csv
- key_chosen_improved.csv
- key_better_improved.csv
- key_composite_chosen.csv

Important: The homework will draw upon topics mentioned in Lab01 through Lab03 and their associated lectures. If needed, review the content mentioned. Additionally, make sure to close all the files you have opened; not doing so will break the autograder.

Part 1 - Getting set up with our data

First, we need to make sure that we can appropriately read in the data line by line, parsing/converting each line into a list and converting each element in the list to the appropriate type. Your program should all be contained in the admissions.py file.

Task 1

Read in your data set in the main() function, looping through its contents line by line. Make use of the str.split(delimiter) function to break individual lines into a list of elements. Make sure that you've done this by printing your list after using the split() function. You'll delete this print statement later but make sure to double check this before moving on! Once you have each line in a list, save the student's name in a variable, then delete the name from your list.

Task 2

Once you have a list of strings for each line, you will write a function convert_row_type() that takes one list of elements (representing the data for one student) as a parameter and converts it so that all numbers (currently represented as strings) are converted to floats. Make sure not to lose any information when you do this conversion! Implement this as a pure list function. , i.e., return a new list and do not modify the list passed in.

Example:

Input: ["1300","3.61","10","7","95","86","91","94"]
Return value: [1300.0,3.61,10.0,7.0,95.0,86.0,91.0,94.0]

Task 3

In main, once you've called convert_row_type() on the list representing one row, call the provided check_row_type(). If this function returns False, print out an error message. Ensure that none of the rows in your data return False when passed to this function.

Task 4

Separate your data. Use list slicing to separate your list (which should contain 8 numbers at this point) into two lists: one that contains the student's SAT, GPA, Interest, and High School Quality scores, and one that contains their 4 semester grades. You'll do Parts 2 - 4 with the first list of 4 numbers and Part 5 with the list of grades.

Part 2 - Prospective Student Score (20 points)

Task 1

Write a function calculate_score() that takes a list as a parameter and calculates an overall score for each student based on the following weights:

30% SAT
40% GPA
10% demonstrated interest
20% strength of curriculum

The list parameter will contain all of the relevant information about the student. The return value is the student’s calculated score.

To make this work, you will also need to normalize both GPA and SAT so that they are also on a 0 to 10 scale. To do this, multiply the GPA by 2, and divide the SAT score by 160.

Example:

Input: [1300.0,3.61,10.0,7.0] - which represents a student with a 1300 SAT score, a 3.61 GPA, 10 out of 10 for interest and 7 out of 10 for high school quality

Output: ((1300 / 160) * 0.3) + ((3.61 * 2) * 0.4) + (10 * 0.1) + (7 * 0.2) = 7.73 out of 10

To match the autograder, perform the calculations in the above order and do not round.

If you are curious on why you must do the calculations in the above order, computers can have a difficulty representing floating-point numbers, and as a result, get a wrong answer when doing operations on them. This is formally called floating-point imprecision if you want to research it. Here is one example in Python:
>>> 0.1 + 0.2
0.30000000000000004

Task 2

In your main() function, modify your loop that reads in and converts your data to call the calculate_score() function for each line (row) of data (after you've converted it). Then, write the student's id and their calculated score to a new file called student_scores.csv such that each row contains a student’s name and their score, separated by a comma.

Example:

Abbess Horror ,1300,3.61,10,7,95,86,91,94
Adele Hawthorne ,1400,3.67,0,9,97,83,85,86
Adelicia von Krupp ,900,4,5,2,88,92,83,72

lines written to file:

Abbess Horror ,7.73
Adele Hawthorne ,7.36
Adelicia von Krupp ,5.79

To match the autograder, write the scores only out to two decimal places. This can be done:

name = "Dr. Stephens"
score = 7.1
print(f"{name},{score:.2f}")

This will print "Dr. Stephens,7.10"

You start the formatted string with an f, then enclose the variables in curly braces {}. The :.2f tells the formatter to print the variable as a floating point number with 2 digits after the decimal point (.). You can use these formatted strings anywhere you would use a string, i.e. print(), write(), & assignment statements, etc.

Task 3

Write the names for all students who have a score of 6 or higher to a file called chosen_students.csv. You should do this in your main() function, where you have access to the returned calculated score for each student and their student name.

Example:

Abbess Horror ,1300,3.61,10,7,95,86,91,94
Adele Hawthorne ,1400,3.67,0,9,97,83,85,86
Adelicia von Krupp ,900,4,5,2,88,92,83,72

lines written to file:

Abbess Horror
Adele Hawthorne

Important: Make sure to close all the files you opened in your program. Not doing so will break the autograder.

Before continuing, you should check if the files you wrote to (chosen_students.csv and student_scores.csv) match the key files given under the test_files directory. You can search online for tools to help like a file difference checker.

Part 3 - Looking for Outliers (10 points)

Consider ways that this algorithm might systematically miss certain kinds of edge cases. For example, what if a student has a 0 for demonstrated interest because they don’t use social media or have access to a home computer? What if a student has a very high GPA but their SAT score is low enough to bring their score down; could this mean that they had a single bad test taking day?

Task 1

Write a function is_outlier() that can check for certain kinds of outliers. It should check that:

if the demonstrated interest score is 0 or
if the normalized GPA that is more than 2 points higher than the normalized SAT score.

If either of these conditions is true, it should return True (because this student is an outlier); otherwise, the function returns False.

Task 2

Call is_outlier() for each student from your main() function and write the students' names to a file called outliers.csv, one name per line if they are an outlier.

Task 3

Combine the work that you've done now to create an improved list of students to admit to your school. Write students' names, one per line, to the file chosen_improved.csv if they either have a score of 6 or greater OR if they are an outlier and their score is 5 or greater. Make sure to take advantage of the work that you’ve already done by calling your functions from previous problems to help you out!

Part 4 - Slightly Improved Algorithm (5 points)

Task 1

Create a calculate_score_improved() function that calculates a student score, checks if it is an outlier, and returns True if the student has a score of 6 or higher OR was flagged as an outlier; otherwise, return False. Make sure to take advantage of the work that you've already done by calling your functions from previous problems to help you out!

Task 2

Call calculate_score_improved() from your main() and output each student’s information (name, SAT, GPA, interest score, and high school quality) to a new file called better_improved.csv if calculate_score_improved() returned True for them.

Part 5 - GPA Checker (15 points)

A single GPA score is not a full picture of a student’s academic performance, as it may have improved over time or included outlier courses or semesters. A more context-sensitive algorithm could consider a student’s entire transcript and checks for, for example, a single class score that is more than two letter grades (20 points) lower than all other scores. For this task, you will use the second half of the data for each student in the provided file.

Task 1

Write a function grade_outlier() that takes in a list of grades (of any length) and returns True if one single number is more than 20 points lower than all other numbers; otherwise, False.

Example:

Input: [99, 94, 87, 89, 56, 78, 89]

Hint: Sort the list from lowest to highest, and check for the difference between the two lowest grades.

78 - 56 = 22; 22 > 20

Output: True

Next, consider the data that we have: a list of grades for each student, one grade per semester for four semesters.

Make sure that your grade_outlier() function works by calling it for every row in the second dataset. Print out an informative message about which students have a single grade outlier. You'll delete this later but it's a great way of testing your function!

Finally, consider the importance of an algorithm being able to flag students who might have a lower overall GPA but have shown improvement over time.

Task 2

Create a function grade_improvement() that returns True if the average score of each semester is higher than or equal to each previous semester and False otherwise.

Hint: investigate how the == operator works between two lists and think about using the Python's sorted() function.

Task 3

Using the grade information that you've just learned, create your own conditions based on the information from the previous problems and grade_outlier() and grade_improvement() to chose all students if they either have a score of 6 or greater or if they have a score of 5 or more and at least one of the following is true:

is_outlier() returns True
grade_outlier() returns True
or grade_improvement() returns True

Write the students who fit this description to composite_chosen.csv, one name per line.

Important: Make sure to close all the files you opened in your program. Not doing so will break the autograder.

Turn in your work

You'll submit your admissions.py file on Canvas via Gradescope where it will be checked via the auto grader. We will be testing your program using a file with the same format but different data! Make sure that you haven't "hard coded" anything specific to your data. We do not guarantee that all scenarios are tested by the code that we have provided you.