Midway Report - Student Loan Prediction and Repayment Risk

teamlosdatos
Mar 10, 2018
9 min read

Updated: May 1, 2018

What are the potential risks associated with student debt and what are the best methods of mitigation?

1. INTRODUCTION

Is college education the best investment after high school? How much will that investment

cost? If student loan debt is inevitable, how much can I expect to pay given the institution I

attend? We hope to answer these questions and more throughout this project by

performing a detailed analysis on a dataset used by Adam Looney at Brookings, a nonprofit

public policy organization based in Washington, DC.

Issues with student loans and financial aid touch nearly every college applicant across the

nation. Tuition has increased exponentially over the last 50 years. This project is interesting

because we hope to predict several key indicators like Overall Outstanding Principal

Balance and Overall Repayment Rate for schools provided in the dataset. Additionally, we

hope to predict student loan debt given indicators like the parent’s adjusted gross income

(AGI) or an independent student’s AGI when applying to college.

The beneficiaries of this project will be any college applicant looking to find information

about student loan data given a school. This project will also give applicants an idea of

what may be expected, given their financial makeup, regarding student loans at an

institution.

1.1 Project goals

The primary objectives of this project are to...

• Classify the risk of borrowing student loans at various post-secondary institutions

• Determine predictive reasoning capabilities of provided attribute data for

determining the likelihood of student loan repayment

• Create a model to predict the loan repayment risk associated with attending a

particular school

Each of these tasks will be expanded upon in following sections.

2. RELATED WORK

The following sections summarize works done by others that are related to this project.

2.1 A Risk Sharing Proposal for Student Loans

The policy proposal titled "A Risk Sharing Proposal for Student Loans" by Tiffany Chou,

Adam Looney and Tara Watson focuses on proposing a project "to introduce new and

effective policy options" in order to improve economic opportunity for students in order to

induce long-term prosperity. The proposal acknowledges the difficulty of students to repay

their student loans at specific institutions. The paper basically proposes "an institutional

accountability system" in order to align incentives of institutions with students and tax

payers. The paper suggests various "risk-sharing" methods to analyze poor loan

performance and overcome them through methods of mutual benefit in order to support

both students and institutions that serve low-income students while also appropriately

reimbursing the federal loan programs.

The following is a link to the article:

http://www.hamiltonproject.org/assets/files/risk_sharing_proposal_student_loans_pp.pdf

2.2 Is High Student Loan Debt Always a Problem?

This policy brief was written by Constantine Yannelis and Adam Looney in 2016 in

conjunction with the Stanford Institute for Economic Policy Research. This short briefing

summarizes key findings in the dataset offered. The primary finding being that students

with high loan balances also tend to earn more. Therefore, the loan balance alone will not

provide an accurate depiction of the student’s financial health. Contrary to my initial

hypothesis, students with low loan balances, less than $10,000, default on their loans 5

times as much in comparison to their high loan counterparts. This makes sense, as high

loan borrowers typically went to very selective institutions and likely went on to acquire a

graduate degree (i.e PhD, JD, MBA).

“Labor market outcomes, like unemployment or low earnings, provide a more direct

measure of economic hardship” (Looney, A., & Yannelis, C. (2016)). This briefing suggests

students should spend more time researching their field of study’s opportunities for

employment and potential income in that field. A $50,000 investment may be worth it if you

have the potential to make $110,000 upon graduation.

The following is a link to the article:

http://siepr.stanford.edu/sites/default/files/publications/PolicyBrief-July16.pdf

2.3 A Crisis in Student Loans? How Changes in the Characteristics of Borrowers and in

the Institutions They Attended Contributed to Rising Loan Defaults

This article, written by Adam Looney and Constantine Yannelis in 2015, “examines the rise in student loan delinquency and default [A Crisis in Student Loans...]” by analyzing data gathered from the U.S. Department of Education. This data was collected using earning records produced from tax records and describes federal student borrowing habits. The research within the article found that the increase of student loan default is a consequence of

borrowers attending for-profit schools, non-selective schools, and community colleges. Of

these institutions, for-profit and non-selective schools are primarily responsible for the

increase in student loan default.

The following is a link to the article:

https://www.brookings.edu/wp-content/uploads/2015/09/LooneyTextFall15BPEA.pdf

3. DATA SET AND FEATURES

The data set to be used in this project is composed of a series of data tables containing

higher education student financials data produced by the Federal Student Aid (FSA) which is

an office of the U.S. Department of Education. The data includes populations of

undergraduate, graduate and parent borrowers, as well as their respective loan borrowed

amounts and repayments. These are tabulated by institution of origin, meaning all the data

including repayments rates are aggregated per institution. In addition to the

aforementioned attributes, other example attributes of the data set include the following:

ethnic class, percentage of completion, independent and dependent borrowers count,

Adjusted Gross Income (AGI) for independent, dependent and parent borrowers, borrowers

with Pell grant, median ages of borrowers, etc.

3.1 Tools

The following is a list of tools that will be used throughout the project. Tools or packages

may be changed or added depending on project needs.

• Excel

• Anaconda

• Jupyter Notebook

• Python Programming Language and the following packages:

o Numpy, Pandas, Matplotlib, Scipy, Sklearn, Orange

4. METHODS AND MODELS

4.1 Classification of Repayment Risk

The following data will be used to determine the risk of repayment for each institution.

• Group 1: Overall Outstanding Principal Balance

• Group 2: Overall Repayment Rate

• Group 3: % Increased Balance Borrowers

• Group 4: Defaulted Balance

Group 1 consists of the following data items, and describes the outstanding principal

balance of each school as of September of the respective year. If an institution has a history

of less risky lending habits, it is expected that these values should decrease over time.

• Overall Outstanding Principal Balance 5 YR Cohort (FY 2014)

• Overall Outstanding Principal Balance 4 YR Cohort (FY 2013)

• Overall Outstanding Principal Balance 3 YR Cohort (FY 2012)

• Overall Outstanding Principal Balance 2 YR Cohort (FY 2011)

• Overall Outstanding Principal Balance 1 YR Cohort (FY 2010)

Group 2 consists of the following data items, and describes the share of aggregate balance

entering repayment repaid by cohort of the respective year. If an institution has a history of

less risky lending habits, it is expected that this value should increase over time.

• Overall Repayment Rate 5 YR Cohort (FY 2014)

• Overall Repayment Rate 4 YR Cohort (FY 2013)

• Overall Repayment Rate 3 YR Cohort (FY 2012)

• Overall Repayment Rate 2 YR Cohort (FY 2011)

• Overall Repayment Rate 1 YR Cohort (FY 2010)

Group 3 consists of the following data items, and describes the share of borrowers whose

current principal balance exceeds original principal balance. If an institution has a history of

less risky lending habits, it is expected that these values should decrease over time.

• % Increased Balance Borrowers 2013-2014

• % Increased Balance Borrowers 2012-2013

• % Increased Balance Borrowers 2011-2012

• % Increased Balance Borrowers 2010-2011

• % Increased Balance Borrowers 2009-2010

Group 4 consists of the following data items, and describes the balance of loans currently in

default. If an institution has a history of less risky lending habits, it is expected that these

values should decrease over time.

• Defaulted Balance 2013-14

• Defaulted Balance 2012-13

• Defaulted Balance 2011-12

• Defaulted Balance 2010-11

• Defaulted Balance 2009-10

Linear regression analysis can be used to classify each school’s risk level. If we find the line

of best fit for each group, where X is the year and Y is the value for the group’s school at the

year, the slope of each line will determine potential risk according to Table 1.

Table 1: Slope Risk Assessment

From here, schools can be classified as risky or not risky. Empirical analysis will be used to

determine the best method of classification, as well as the different levels of risk that will be

used for final classification.

4.2 Predictive Reasoning Capability Measurement

After the schools have been classified, the predictive reasoning capability of the attributes

will need to be measured. Possibility of deriving the cause of a prediction will be

determined by correlating each attribute of the data with risk classification. If a positive or

negative correlation exists between the attribute and risk classification, then the data can

be used to reason the prediction a student's probability of repaying a loan given certain

circumstances – for this dataset, the circumstances are defined by the attributes of the

school.

4.3 Predictive Model Creation

Finally, a predictive model can be created to predict a student's probability of repaying a

loan given the attendance of a post-secondary school. This predictive model will exercise

supervised learning, and will be done using a Decision Tree Classifier.

5. RESULTS AND DISCUSSION

The following sections detail the results of the each task defined in the Methods section.

5.1 Classification of Repayment Risk

5.1.1 Data Preprocessing

Before classification could be performed, the data set had to be cleaned. All values the were

declared above or below the required threshold values of the attribute were set to 0 or 1,

respectively.

Additionally, Group 1 (Overall Outstanding Principal Balance) and Group 4 (Defaulted

Balance) were averaged to reduce the bias in the disparity in the borrow population size for

each university.

5.1.2 Borrower Risk Point Assignment Calculation

The slope for each Group over time was then calculated in Excel. The quartiles for each

Group are in the following tables.

Table 2: Average Outstanding Principal Balance Slope Quartiles (Group 1)

Table 3: Repayment Rate Slope Quartiles (Group 2)

Table 4: % Increased Balance Borrowers Slope Quartiles (Group 3)

Table 5: Defaulted Balance Slope Quartiles (Group 4)

Each quartile range was assigned a point value as show in Table 6.

Table 6: Quartile Point Assignment

The Borrower Risk Assignment for each school was then calculated with the following

equation.

5.1.3 Borrower Risk Point Assignment Results

The distribution of the Borrower Risk Assignment is shown in Figure 1.

Figure 1: Borrower Risk Assignment Distribution

Quartiles of the Borrower Risk Assignment were used to assign the final classification of

Borrower Risk for each school, as shown in Table 7.

Table 7: Borrower Classification

5.2 Predictive Reasoning Capability Measurements

Using the repayment risk calculated above, the next step is to use linear regression to

visualize the relationship between risk and the following attributes:

• % Completions Any School

• % Completions Same School

• % Independent Borrower Count

• % Dependent Borrower Count

• Median Independent Student AGI

• % Independent Borrowers with AGI < $30K

• Median Dependent Parent AGI

• % Dependent Borrowers with AGI < $30K

• % Borrowers with a Pell Grant

• Median Age of Dependent Borrowers at Maturity

• Median Age of Independent Borrowers at Maturity

• Mean Balance

• Median Balance

5.2.1 Correlation Analysis

After completing the classification of repayment risk, a correlation analysis was completed

on the attributes listed above in the dataset. The attributes with moderate positive

correlation were:

• Median Dependent Parent AGI

o R-Value = 0.39

• % Borrowers without a Pell Grant

o R-Value = 0.47

Figure 2 and Figure 3 are scatter plots with trendlines of the calculated risk vs Median

Dependent Parent AGI and % Borrowers without a Pell Grant respectively.

Figure 2: Risk Factor vs Median Dependent Parent AGI

Figure 3: Risk Factor vs %Borrowers without a Pell Grant

Attributes with a very weak positive correlation were:

• % Dependent Borrower Count

o R-Value = 0.21

• Mean Balance

o R-Value = 0.22

Figure 4 and Figure 5 are scatter plots with trendlines of the calculated risk vs % Dependent

Borrower Count and Mean Balance respectively.

Figure 4: Risk Factor vs % Dependent Borrower Count

Figure 5: Risk Factor vs Mean Balance

Attributes showing a very weak negative correlation include:

• % Dependent Borrowers with AGI < $30k

o R-Value = -0.25

Figure 6 is a scatter plot with a trendline of the calculated risk vs % Dependent Borrowers

with AGI < $30k.

Figure 6: Risk Factor vs % Dependent Borrowers with AGI < $30k

5.2.2 Association Analysis

In addition to correlation analysis, it was decided to perform association analysis to

determine the correlation between the risk classification and the other non-numerical

attributes in the data set. The non-numerical values describing the institutions in the data

set include the following: Eligibility for financial aid, certification, type and control and the

newly calculated borrowing risk classification.

Therefore, the Orange python package was used to perform this analysis. The rules were

generated from the data using minimum support as 0.1 and minimum confidence as 0.2.

Using Orange’s capabilities, the lift measure was also computed for each of rules

generated. The lift was the predominant measure in this portion of the project since it is the

measure of dependent/correlated events. The association rules generated were also sorted

for clear visualization. The following is the resulting table of association rules.

Table 7: Association Rules for Borrowing Risk

Based on the initial conditions, it was determined that the type and control attribute was

found to not be a strong association rule. In other words, in this initial assessment for the

project, type and control will not be a deciding attribute in our prediction model. Therefore,

the rest of the attributes, including eligibility for aid and certification, were found to be part

of the strong association rules with lifts that display either negative and positive correlation,

depending on the association rule. The next step in this portion of the project will be to

carefully analyze each of the association rules to see what further details can be extracted.

5.3 Predictive Model Creation

The predictive model has not yet created. The initial steps for the creation are being

examined for further refinement.

6. CONCLUSION

The conclusion will be completed as soon as the overall project reaches a considerable level

of maturity and completion.

Los Datos

Milestone Updates

Midway Report - Student Loan Prediction and Repayment Risk

What are the potential risks associated with student debt and what are the best methods of mitigation?

Recent Posts

Comments