Project Proposal - Predicting Student Loan Debt
- teamlosdatos
- Mar 10, 2018
- 4 min read
Updated: Mar 21, 2018

1. Project Description
Is college education the best investment after high school? How much will that investment cost? If student loan debt is inevitable, how much can I expect to pay given my demographic? We hope to answer these questions and more throughout this project by performing a detailed analysis on a dataset collected and published by Adam Looney at Brookings, a nonprofit public policy organization based in Washington, DC.
Issues with student loans and financial aid touch nearly every college applicant across the nation. Tuition has increased exponentially over the last 50 years. This project is interesting because we hope to predict several key indicators like Overall Outstanding Principal Balance and Overall Repayment Rate for schools provided in the dataset. Additionally, we hope to predict student loan debt given indicators like the parent’s adjusted gross income (AGI) or an independent student’s AGI when applying to college.
The beneficiaries of this project will be any college applicant looking to find information about student loan data given a school. This project will also give applicants an idea of what may be expected, given their financial makeup, regarding student loans at an institution.
2. Project goals
The primary objectives of this project are to classify school's repayment risk, determine predictive capabilities of provided attribute data for determining likelihood of student loan repayment and, if a predictive capability exists, create a predictive model. Each of these tasks will be expanded upon in following sections.
2.1 Classification of Repayment Risk
The following data will be used to determine the risk of repayment for each institution.
Group 1: Overall Outstanding Principal Balance
Group 2: Overall Repayment Rate
Group 3: % Increased Balance Borrowers
Group 4: Defaulted Balance
Linear regression analysis can be used to classify each school’s risk level. If we find the line of best fit for each set, the slope of each line will determine potential risk according to Table 1.
Table 1: Slope Risk Assessment
ID Risky Not Risky
Group 1 Positive Negative
Group 2 Negative Positive
Group 3 Positive Negative
Group 4 Positive Negative
From here, schools can be classified risky or not risky. Empirical analysis will be used to determine the best method of classification.
2.2 Predictive Capability Measurement
After the schools have been classified, the predictive capability of the attributes will need to be measured. Possibility of prediction will be determined by correlating each attribute of the data with risk classification. If a positive or negative correlation exists between the attribute and risk classification, then the data can be used to predict a student's probability of repaying a loan given certain circumstances.
2.3 Predictive Model Creation
Finally, if enough correlation exists, a predictive model can be created to predict a student's probability of repaying a loan given: dependent or independent, AGI, Age, and need for financial aid (or some mix of those things). This predictive model will exercise supervised learning, and will be done using a Decision Tree Classifier.
3. Dataset Description
The data set to be used in this project is composed of a series of data tables containing higher education student financials data produced by the Federal Student Aid (FSA) which is an office of the U.S. Department of Education. The data includes populations of undergraduate, graduate and parent borrowers and their respective loan borrowed and repayments. These are tabulated by institution of origin, meaning all the data including repayments rates are aggregated per institution. In addition to final balance, loan and repayment information, other example attributes of the data include the following: ethnic class, percentage of completion, independent and dependent borrowers count, Adjusted Gross Income (AGI) for independent, dependent and parent borrowers, borrowers with Pell grant, median ages of borrowers, etc.
4. Tools
The following is a list of tools that will be used throughout the project. Tools or packages may be changed or added depending on project needs.
Anaconda Navigator
Jupyter Notebook
Python Programming Language and the following packages:
Numpy, Pandas, Matplotlib, Scipy, Sklearn
5. Literature Review
5.1 A Risk Sharing Proposal for Student Loans
The policy proposal titled "A Risk Sharing Proposal for Student Loans" by Tiffany Chou, Adam Looney and Tara Watson focuses on proposing a project "to introduce new and effective policy options" in order to improve economic opportunity for students in order to induce long-term prosperity. The proposal acknowledges the difficulty of students to repay their student loans at specific institutions. The paper basically proposes "an institutional accountability system" in order to align incentives of institutions with students and tax payers. The paper suggests various "risk-sharing" methods to analyze poor loan performance and overcome them through methods of mutual benefit in order to support both students and institutions that serve low-income students while also appropriately reimbursing the federal loan programs.
The following is a link to the article:
http://www.hamiltonproject.org/assets/files/risk_sharing_proposal_student_loans_pp.pdf
5.2 Is High Student Loan Debt Always a Problem?
This policy brief was written by Constantine Yannelis and Adam Looney in 2016 in conjunction with the Stanford Institute for Economic Policy Research. This short briefing summarizes key findings in the dataset offered. The primary finding being that students with high loan balances also tend to earn more. Therefore, the loan balance alone will not provide an accurate depiction of the student’s financial health. Contrary to my initial hypothesis, students with low loan balances, less than $10,000, default on their loans 5 times as much in comparison to their high loan counterparts. This makes sense, as high loan borrowers typically went to very selective institutions and likely went on to acquire a graduate degree (i.e PhD, JD, MBA).
“Labor market outcomes, like unemployment or low earnings, provide a more direct measure of economic hardship” (Looney, A., & Yannelis, C. (2016)). This briefing suggests students should spend more time researching their field of study’s opportunities for employment and potential income in that field. A $50,000 investment may be worth it if you have the potential to make $110,000 upon graduation.
The following is a link to the article:
http://siepr.stanford.edu/sites/default/files/publications/PolicyBrief-July16.pdf
5.3 A Crisis in Student Loans? How Changes in the Characteristics of Borrowers and in the Institutions They Attended Contributed to Rising Loan Defaults
This article, written by Adam Looney and Constantine Yannelis in 2015, “examines the rise in student loan delinquency and default [A Crisis in Student Loans...]” by analyzing data gathered from the U.S. Department of Education. This data was collected using earning records produced from tax records and describes federal student borrowing habits. The research within the article found that the increase of student loan default is a consequence of borrowers attending for-profit schools, non-selective schools, and community colleges. Of these institutions, for-profit and non-selective schools are primarily responsible for the increase in student loan default.
The following is a link to the article:
https://www.brookings.edu/wp-content/uploads/2015/09/LooneyTextFall15BPEA.pdf
Comments