Data Engineering Basics. Python and Data. RDBMS vs Pandas.
Hmmm…. What to use… What to use? When I want to explore data quickly and with the least amount of pain, the first problem I face is where do I start. There are a million approaches and I’m usually thinking long-term, ease of maintenance, surrounding platform, etc etc.
Source code found on my github page IowaCornYields.
But you have to start somewhere right? The first major hurdle is do I need to use a RDBMS (Relational Database Management System) or not. RDBMS’s (saying this makes me think of ROUS’s (Rodents Of Unusual Size)) have a lot of advantages like the speed and ease of basic SQL. But, it comes with some serious headaches up front. What headaches? I basically have two projects now…
- Just get data
- Get data into RDBMS system so i can play with it.
So what about just using some sweet Python library like Pandas? Then I can get data and play with it in one breath.
Let’s answer a simple question using each approach and see what we like better. The first thing I did was go to ttps://quickstats.nass.usda.gov/ , a great place to get free, interesting data in CSV format. Lots of agriculture data.I download Iowa corn yields by county in a few clicks of a button. File can be found on my github page. It’s simple, has Year, County Name, Yield in bushels per acre, and some other random stuff.
Question : Last year, what were the top 10 counties in Iowa that had the highest corn yields?
Let’s find out!
Option 1 : Pandas (pip install pandas if you don’t have it)
import pandas as pd df = pd.read_csv('C://Iowa Corn Yields.csv') total = df.groupby(['Year','County'])[['Value']].sum() print(total.query('Year==2017').sort_values('Value', ascending=False).head(10))
MARSHAL | 222.4 |
CEDAR | 222.3 |
CRAWFORD | 221.9 |
BUCHANAN | 221.3 |
SIOUX | 220.7 |
TAMA | 220.6 |
WASHINGTON | 220.4 |
SCOTT | 220 |
LYON | 218.9 |
Dude! How easy was that?? pd.read_csv() <- pretty sure it doesn’t get much easier than that.
Anyone who knows anything about SQL will recognize the groupby() statement in pandas. It does exactly what it says. I’m wanting to SUM() up bushels per acre by year, by county.
Also, an easy way to filter in pandas is user .query() on a dataframe. So I sorted down to 2017. Of course the .sort_values() is obvious enough too. The .head(10) is just giving me the top 10.
Option 2: RDBMS.
First things first, need to create the SQL table to hold that data that I want.
CREATE TABLE dbo.CornData (YEAR INT, County NVARCHAR(150), VALUE DECIMAL(10,1))
Then away we go with the Python…
import csv import pyodbc def readinfile(self): with open(self,'r') as f: reader = csv.reader(f) listy = list(reader) return listy cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=YOURSERVER;DATABASE=YOURDATABASE;UID=YOURUSERID;PWD=YOURPASSWORD') corn_Data = readinfile('C://Users//Daniel.Beach//AppData//Local//Programs//Python//Python35-32//Iowa Corn Yields.csv') for line in corn_Data: year = corn_Data[0] county = corn_Data[8] value = corn_Data[18] cursor = cnxn.cursor() cursor.execute("INSERT INTO dbo.CornData (Year,County,Value) VALUES (?,?,?)",year,country,value) cnxn.commit() cursor.execute("SELECT County,SUM(Value) as Bushels FROM dbo.CornData WHERE Year = 2017 GROUP BY County") Results = {} for row in cursor.fetchall(): print(row) cnxn.close()
Jezzz. That wasn’t very fun. I had to import two modules, make sure Python has access to the RDBMS, create the SQL table, read the data into a list, open the file and figure out which column my data was in that I needed, loop through the list and write it to the RDBMS, query the RDBMS with a SQL statement and print it back out.
Which one would you rather do??