Exploring LendingTree Data with Python/SQL

One of the biggest hurdles I’ve found when teaching myself any sort of SQL/Python/Data Wrangling skills is the problem of finding usable, real life data to work with. Data that I can actually attempt to answer questions with.

There is no better way to learn than to be given a data set, in it’s raw form and then try to get the data into a form where you can do something fun with it.

Enter LendingTree open source data. Man, when I found this it was like data heaven. https://www.lendingclub.com/info/download-data.action

Here you will be given the option of downloading csv files of data from LendingTree, with a bunch of data about the loans given on that platform.

The first step was to download all 11 files over the different time spans. This is the real world right?

I didn’t want to sign in to get the full file or whatever, i just wanted the data, free, now. Bam, now i have 11 files full of data.

Next i just opened the smallest file and decided which columns I actually wanted to mess around with, what looked interesting, and I junked the rest (deleted). The original files are very wide with a bunch of junk, i figured
this would make things run a little faster.

In the end I got 1,765,426 rows all about loans done by LendingTree between 2007 and 2017.

I then just created a rough staging table in SQL Server to hold the columns that I wanted to hold the data from the csv files.

I always use large NVARCHAR columns to avoid any truncation or other data related problems. We can update data types later by altering columns. Plus I did some other obvious cleanup in SQL.

CREATE TABLE [dbo].[LendingTree](
	[Loan_Amount] [nvarchar](500) NULL,
	[Funded_Amount] [nvarchar](500) NULL,
	[Funded_Amount_Inv] [nvarchar](500) NULL,
	[Term] [nvarchar](500) NULL,
	[Interest_Rate] [nvarchar](500) NULL,
	[Installment] [nvarchar](500) NULL,
	[Grade] [nvarchar](500) NULL,
	[Home_Ownership] [nvarchar](500) NULL,
	[Annual_Income] [nvarchar](500) NULL,
	[Verification_Status] [nvarchar](500) NULL,
	[Issue_Date] [DATE] NULL,
	[Loan_Status] [nvarchar](500) NULL,
	[Address_State] [nvarchar](500) NULL,
	[Open_Accounts] [nvarchar](500) NULL,
	[Application_Type] [nvarchar](500) NULL
)
 
DELETE FROM [Configuration].[dbo].[LendingTree] WHERE Issue_Date = '1900-01-01'
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Loan_Amount BIGINT
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Funded_Amount BIGINT
UPDATE [Configuration].[dbo].[LendingTree]  SET Interest_Rate = REPLACE(Interest_Rate,'%','')
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Interest_Rate DECIMAL(18,2)
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Installment DECIMAL(18,2)
UPDATE [Configuration].[dbo].[LendingTree] SET Annual_Income = 0 WHERE Annual_Income = ''
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Annual_Income DECIMAL(18,2)
ALTER TABLE [Configuration].[dbo].[LendingTree] ALTER COLUMN Open_Accounts INT
 
CREATE NONCLUSTERED INDEX IDX_LendingTree ON dbo.LendingTree (Issue_Date,Loan_Amount) INCLUDE(Funded_Amount, Interest_Rate,Installment,Home_Ownership,Annual_Income,Address_State,Open_Accounts)

HaHa. Done with the easy part, now it’s on to the problem of actually getting the data from my csv files into SQL Server.

This is another reason I love Python, not only is it fun to use, easy to learn, but there is a million ways to write the same thing, and usually it is short and sweet. I’m no Python expert. Here is my code to import the files.

import os
import csv
import pyodbc
 
#list files in directory
def listFiles():
	files = os.listdir('C:\\Users\\Daniel.Beach\\Desktop\\LendingCLubData')
	return files
 
#read a row in a file
def fileToRow(fullfile):
	with open(fullfile, 'r') as f:
		reader = csv.reader(f)
		next(reader) #skip header
		for row in reader:
			yield row
 
#take a row and put it in SQL Server
def rowToSQL(row):
	cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=YourServer;DATABASE=YourDatabase;UID=UserName;PWD=Passwpord') #connect
	cursor = cnxn.cursor() #open cursor
	for column in row:
		la = column[0]
		fa = column[1]
		fai = column[2]
		term = column[3]
		ir = column[4]
		i = column[5]
		g = column[6]
		ho = column[7]
		ai = column[8]
		vs = column[9]
		id = column[10]
		ls = column[11]
		ads = column[12]
		oa = column[13]
		at = column[14]
		query = cursor.execute("INSERT INTO dbo.LendingTree VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",la,fa,fai,term,ir,i,g,ho,ai,vs,id,ls,ads,oa,at)
	cnxn.commit()
	print('file done')
 
def main():
	fileList = []
	l = listFiles()
	for f in l:
		directory = 'C:\\Users\\Daniel.Beach\\Desktop\\LendingCLubData'
		fullfile = os.path.join(directory,f)
		row = fileToRow(fullfile)
		rowToSQL(row)
	cnxn.close()
 
 
if __name__ == '__main__':
	main()

Now that the 1,765,426 rows are in my SQL Server, it’s quite easy to do some very simple analysis.

I choose to use Google’s open source Charts to display some of the queries.
https://developers.google.com/chart/ Try it sometime, very easy, fun, and free.

--Here are the SQL Code Snipits. 
--Loan Amounts by Year
SELECT SUM(Loan_Amount) Total_Loan_Amount,YEAR(Issue_Date) AS YEAR
FROM Configuration.dbo.LendingTree
GROUP BY YEAR(Issue_Date)
 
 
--Loan Amounts by Year and Month
SELECT SUM(Loan_Amount) Total_Loan_Amount,YEAR(Issue_Date) AS YEAR,MONTH(Issue_Date) AS MONTH
FROM Configuration.dbo.LendingTree
GROUP BY YEAR(Issue_Date),MONTH(Issue_Date)
ORDER BY YEAR(Issue_Date), MONTH(Issue_Date) ASC
 
--Running Sum Year over Year by Month
SELECT YEAR,MONTH,SUM(Total_Loan_Amount) OVER (PARTITION BY YEAR ORDER BY YEAR,MONTH ASC ROWS UNBOUNDED PRECEDING) AS RunningTotalbyYear
FROM (
	SELECT SUM(Loan_Amount) AS Total_Loan_Amount,YEAR(Issue_Date) AS YEAR,MONTH(Issue_Date) AS MONTH
	FROM Configuration.dbo.LendingTree
	GROUP BY YEAR(Issue_Date), MONTH(Issue_Date)
	) base
ORDER BY YEAR,MONTH
 
--Avg Loan Size
SELECT AVG(Loan_Amount) Avg_Loan_Amount,YEAR(Issue_Date) YEAR
FROM [Configuration].[dbo].[LendingTree]
GROUP BY YEAR(Issue_Date)
 
--Pivot Yr over Yr for Javascript Data Table Consumption
SELECT MONTH,ISNULL([2007],0)[2017],[2008],[2009],[2010],[2011],[2012],[2013],[2014],[2015],[2016],[2017]
FROM (
	SELECT MONTH(Issue_Date) MONTH,YEAR(Issue_Date) YEAR,SUM(Loan_Amount)Loan_Amount
	FROM Configuration.dbo.LendingTree
	GROUP BY MONTH(Issue_Date),YEAR(Issue_Date)
	) pvt
PIVOT (		
		SUM(Loan_Amount) 
		FOR YEAR IN ([2007],[2008],[2009],[2010],[2011],[2012],[2013],[2014],[2015],[2016],[2017])
	  ) AS p
 
 
--Rent vs Own
SELECT Home_Ownership, COUNT(*) NUMBER
FROM Configuration.dbo.LendingTree
GROUP BY Home_Ownership
ORDER BY COUNT(*) DESC
 
--Average Income vs Home Ownership
SELECT Home_Ownership,AVG(Annual_Income) Average_Income
FROM [Configuration].[dbo].[LendingTree]
GROUP BY Home_Ownership
ORDER BY Average_Income DESC
 
--States with most Total Loans Amounts.
SELECT TOP 10 Address_State,SUM(Loan_Amount) Loan_Amount
FROM [Configuration].[dbo].[LendingTree]
GROUP BY Address_State
ORDER BY Loan_Amount DESC
 
--States with least Total Loans Amounts.
SELECT TOP 10 Address_State,SUM(Loan_Amount) Loan_Amount
FROM [Configuration].[dbo].[LendingTree]
GROUP BY Address_State
ORDER BY Loan_Amount ASC

Next time we are going to try something fun, like try to predict Home Ownership based on a bunch of the other inputs like Annual Income, Interest Rate, Loan Grade, Loan Amount etc.

Exploring LendingTree Data with Python/SQL

Interesting links

Pages

Categories

Archive