The Machine Learning Lifecycle


April 27, 2020

I am super excited to discuss my most recent project. I have been fortunate enough to complete a Python course for my MSF degree at the University of Utah. While I have experience in python, I thought the course would be a nice way to apply my skills in a finance mentality. Up to this point, I haven't had the privilege to do so in relation to other courses or personal pursuits.

For this course, we were assigned a final project, which revolved around credit risk modeling. For those who may not be familiar, let's walk through the basics. Credit risk modeling, roughly, relates to the practice of predicting the outcome of a particular loan. Loans require an individual to have credit, hence the 'Credit' in credit risk modeling. So the practice relates to evaluating variables related to an individual, the loan, and the loan issuer in an attempt to predict whether a loan will "Charge-Off" (default) or "Pay-in-Full". This is a rather common method to educate students in the usage of classification prediction and the usage of algorithms like decision tree classifiers, random forest classifiers, logistic regression, and the like. Due to my personal study habits, I've been able to learn and implement such models in other coursework. So, the project itself was a repetition of prior materials, which led me to want to challenge myself.

A little background is warranted before we move on. At the beginning of 2020, I interviewed for a quantitative developer role at a local bank. This role would have had me building models upon this exact subject and implementing them in their overall organization by the end of the year. Well, that position wasn't offered to me. So, in the most non-spiteful and entirely self-motivated reason, I decided to blow this credit risk modeling project out of the park. You know, just to show that I can do everything the position requested and probably more.

The following materials relate to the entire machine learning life cycle. I was able to take raw data, perform initial data analysis, iteratively develop an adequate model, then move the model into a production environment. Now, this entire process was super intensive and won't be described in full here. Rather, if you're interested in the whole story, check out my project report. What I will discuss is mainly an overview of my model and the production implementation, since it was the biggest challenge for me.

To begin, and overview of the dataset is warranted. The data was provided by my professor, Dr. Brent Albrecht. The dataset is split into two sets, a train and test set with 10,000 and 1,000 loan records respectively. The first dataset, the train set, I split into two. Which resulted in a "Train" and "Validation" set. This split allowed for me to train models and test their performance on the validation set. This process is an essential step in the data science process for two reasons.

  • 1) Training models on a validation set allows for cross-validation. This topic relates to using "labeled" data to better train a model. If I were to build models and train them on the test set, I would be shooting into the dark. I wouldn't be able to truly see how my model is performing.

  • 2) Training on test data is like looking into the future. I don't truly know how the test set loans perform, they are unlabeled. Without these labels on the loans I end up only creating a model around some data, not creating a model that does well at explaining the data.

After cleaning and splitting my data, I created five (5) models based on two different variable sets. Among my iterations I trained logistic regression, decision tree classifiers, and random forest classifiers. Ultimately, utilizing a random forest classifier as my final model. Again, if you want to see the meat of the project, head over to my project report.

After an immense development process, I had a model ready to utilize. For the course project, I had to implement my trained model on the true test set. After doing this, I wanted to go a step further. I had a model, that's great, but now what? Through a lot of my time learning data science I've seldom been educated on how to implement my models in a way that a user would be able to benefit from the model development. Well, being the savvy Django developer I am, it was a no brainer to create a web-application to house my model.

So, that's what I did. I created and entire web-application, housed it on Heroku, made a UI that allows for a user to select variables related to a loan, and created a function that spits out the probability of the user's inputs. To be honest, it's pretty badass. I've never been able to make something quite like this final product and I'm enthralled. Essentially, the app works by taking user input, feeding it to the pre-trained model I mentioned before, and offers up it's opinion on the default probability of the loan. What is even more awesome, is that the inputs users provide to the model are saved along with their probability prediction values into a PostgreSQL database. This would allow me to manage the prediction quality of the app over time if I were able to continuously monitor loan outcomes that are inputed into the application. Check out the app using the link below.

Here's the link to the web-app.

Overall, this was an awesome task for me. I was able to complete the entire development lifecycle of a machine learning model. This has a lot of weight for me since I am aiming to land a Machine Learning Engineer role come graduation in December of 2020. Yes, that was a plant for any potential recruiter to find if they stumble upon this post, sue me. Anyway, this is a super light hearted discussion of something I am quite passionate about. I am proud to showcase this to you and hope you enjoyed reading about, and using, my credit risk application.

Cheers!

P.S. I started my own LLC! Check out inferencearchitects.com. We are data scientists and software engineers who helps small and medium sized businesses better manage their data and at a lower cost than hiring full-time developers. Let us know if you want to get a better handle on your data, it's what we love to do.