, , ,

Database/SQL Fundamentals for Data Engineers

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of companies are able to operate as data driven businesses just isn’t important to some people. There is probably nothing more important in the tool belt of a data engineer than being above average at SQL and databases. Tuning queries, writing queries, indexing, designing data warehouses. I’m sure there are some Hadoop data engineers who skipped this step of RDBMS world, but that is not the normal path of a data engineer. Let’s dive into the fundamentals of SQL and databases.

Does the type of database matter?

Someone might ask…. shouldn’t we worry about Postgres, MySQL, SqlServer… doesn’t that matter? Not really. Being good at the fundamentals of SQL really has little to do with brand name slapped on top. Remember, we aren’t talking about being a DBA here… we are talking about being a data engineer. We’ve aren’t setting up automatic failover clusters.

The DBA’s ears perk up when I say the type of RDBMS doesn’t matter.

I think of it like working on your car, of course there are differences and you might have to check the manual a few times. But once you figure out how to change the oil, the brakes, air-filter, the light bulbs… you can work on almost any car ( just not a Tesla… aka an Oracle database … puke).

All the angry DBA’s right now can just send me emails that I will promptly delete. No, the type of RDBMS doesn’t matter to learn SQL/database fundamentals.

The fundamentals of SQL/Databases.

  • OLTP vs. OLAP.
  • Table design/layout.
  • Data warehousing concepts.
  • Understanding indexing basics.
  • How to write fast/tune queries.
  • Where to look for common problems.

OLTP vs. OLAP

I won’t spend much time here, but understanding OLTP vs OLAP is pretty important. There are plenty of long winded explanations about this, but it really isn’t that hard.

  • OLTP = highly transactional in nature.
  • OLAP = analytical in nature…. think data warehouse.

Many times OLTP tables will be wide with lots of columns, designed to handle high volumes of inserts usually. Think maybe an online orders coming into a system, financial or banking transactions hitting a table. OLAP on the other hand is designed for aggregating and summarizing data, think analytics and data warehousing.

Table design/layout

I’m going to mostly talk about table design and layout that data engineers will mostly likely run into. If you start reading a lot about database and table design you will probably run into people talking about normal forms… “you should design this in the third normal form.” I do suggest you read up on this concept if you plan on working in and around classical RDBMS data warehouse environments. I believe table designs are closely related to the OLTP and OLAP thoughts list above. It really isn’t that hard to avoid bad table design, most of it is common sense. Here are some rules I’ve used in the past.

  • You have to know the queries first (Try to understand data access patterns, even at a high level before trying to design table layouts).
  • Break the data up into logical units (customer data, orders, products, blah blah).
  • Don’t put everything in one table (this use case is rare).
  • When you’re designing a table think about how it relates to other tables (…. how will it join to other data).
  • Simple is usually better than fancy ( and means faster queries in the end most likely.)
  • Understand data types ( if you haven’t though through the data type of each value, then you aren’t ready to design the table yet).

Let’s take a simple example. Hobbits…. everyone loves hobbits.

who doesn’t like hobbits?
CREATE TABLE middle_earth.Hobbits (hobbit_id INT NOT NULL,
                                   first_name VARCHAR(150),
                                   last_name VARCHAR(150),
                                   age INT DEFAULT 1,
                                   favorite_food VARCHAR(250)
                                  );
CREATE TABLE middle_earth.Quests (quest_id INT NOT NULL,
                                  quest_name VARCHAR(150),
                                  location VARCHAR(200),
                                  hobbit_id INT,
                                  treasure_value BIGINT
                                  );
CREATE TABLE middle_earth.Hobbit_Companions (companion_id INT NOT NULL,
                                             first_name VARCHAR(150),
                                             last_name VARCHAR(150),
                                             power_level INT DEFAULT 10
                                            );
CREATE TABLE middle_earth.Adventures (hobbit_id INT NOT NULL,
                                      companion_id INT NOT NULL,
                                      quest_id INT NOT NULL
                                     );

Hobbits… so I made a hobbit table. Hobbits always go on quests… so that is an obvious one. And of course a hobbit will never leave his hobbit hole without a companion, bingo. Finally they will set off on an adventure together, last table. Just apply that simple logic to database table design and layout. Think about them as pieces of puzzle and consider how they relate to each-other. So which hobbits have been going on adventures with whom?

SELECT h.first_name, h.last_name, h.first_name, h.last_name, q.quest_name
FROM Hobbits h
INNER JOIN Adventures a ON h.hobbit_id = a.hobbit_id
INNER JOIN Hobbit_Companions c ON c.companion_id = a.companion_id
INNER JOIN Quests q ON q.quest_id = a.quest_id AND q.hobbit_id = a.hobbit_id

You get the idea. Table design and layout is half art and half science.

Data warehousing concepts.

Closely related to the last topic but a beast of its own. Many of the same concepts still apply when designing classic RDBMS data warehouses. But that are a few topics that will show up a lot and you had better read up on.

  • Facts and Dimensions (think of facts as values that summarized a lot…. sales for example).
  • It’s all about aggregating values.
  • Think about everything we talked about in the previous list.
  • You should definitely know the aggregate queries that will be running.
  • Think about fact tables (that will be aggregated) as the hub in a spoke of a wheel. All the spindles coming off dimensions.
  • Star schema is your friend.

Think about it this way….

CREATE TABLE middle_earth.Adventures (hobbit_id INT NOT NULL,
                                      quest_id INT NOT NULL,
                                      date_quest_completed DATE,
                                      amount_of_treasure BIGINT DEFAULT 0,
                                     );

A good fact table in a data warehouse should consist of some nondescript id’s, maybe a date, and the values you would want to aggregate on. Say like the amount of treasure a hobbit was able to compile over different time intervals. In the above examples the Hobbit table is a good example of a dimension, a table that has important information that better describes and extends the data, but may not be critical for the calculation and aggregation of values. So…

SELECT hobbit_id, MONTH(date_quest_completed) as mnth, YEAR(date_quest_completed) as yr SUM(amount_of_treasure) as treasure,
FROM middle_earth.Adventures
GROUP BY hobbit_id, MONTH(date_quest_completed), YEAR(date_quest_completed)

So we can easily formulate how much treasure our hobbits are gathering over time. Remember simple = better = faster, when it comes to database queries. And data warehouses tend to get large… so designing your tables correctly becomes important.

Understanding indexing basics

Yes, this is what can make a break your database design, OLAP or OLTP. The best tables in the world can quickly become useless with no or incorrect indexing. And honestly, the basics don’t care about what kind of RDBMS you are running… SqlServer, Postgres, MySQL… doesn’t matter.

  • Every table should have indexes ( you might be surprised at this novel idea ).
  • Table join keys should be the primary and most important index (how two tables link together and relate to each other).
  • Columns that show up in GROUP BY and WHERE clauses should be in indexes.
  • If you have to put 10 indexes on your table go back and read the section on table design and layout, because you failed ( or google Data Mart).
  • Think about what makes each table unique. This is a critical topic in database design.
  • Honestly, doing the above will put you way ahead of most people.

Simple examples from above. If you go back and look at the hobbit tables we did in the table design/layout section above it should be very obvious where to place indexes.

  • hobbit_id , quest_id, and companion_id are pretty obvious choices right? They make records in certain tables unique.
  • fist_name and last_name are extremely common fields that we know our end users will want all the time.

What about the data warehouse table Adventures ?

  • Again the ids are going to be extremely important.
  • But also the column amount_of_treasure is being aggregated and date_quest_completed is being used in the GROUP BY. Doesn’t it make sense these would be in a index together?

I really think learning the very basic and obvious approach to database indexing will get you 80% of the way there. Of course the topic is complicated and the examples and use cases can take years of learning to fine tune. Lots to learn but stick with the basics and they will serve you well.

How to write fast/tune queries

your queries gotta be fast.

If you write SQL you will run into slow queries. There are can be two causes of this…

  • poorly written queries
  • underlying database problems.

Firstly, let’s talk about poorly written queries… again, regardless of the database system. A key concept is that databases where designed to have you work on data in sets, groups of data, not row by row.

  • Never do SELECT * , the more data you are asking for the more time your query will take.
  • If your going to do a subquery… don’t put it in another SELECT statement …requiring it to be run for every row.
  • Always have a WHERE clause. Since when do you need to run a query without a WHERE clause?
  • Add more things to the WHERE clause 🙂
  • Stop using VIEWS, they suck and always will.
  • If you have complex data that requires periphery information in the end…. don’t pull that data in until the end. Aggregate first, pull in the details later.
  • Stop using udf‘s, they suck as well.
  • Do you have 7+ table joins in your first query before you’ve even started to subquery? Think again.
  • Understand the data types and the data types of the table joins.

I could go on but you get the point again I’m sure. Just like in good table design, good query design and tuning starts with the simple approach. Think about the very basic things you need to accomplish first, do that and then add the detail later. Try to use SQL that applies to the entire dataset, like GROUP BY, WHERE instead of using a udf or a SELECT in the SELECT statement.

Always try to pair the data down as much as possible before getting fancy.

Where to look for common problems

I usually consider this the last resort. If everything else above is done properly you shouldn’t have to dig farther down very often. But it will happen. And the problems are always side effects of what was not done above.

  • Check indexes first. Are you joining on some column with no index?
  • Are the index stats being updated?
  • Look for bad SQL. Do the joins and subqueries look reasonable?
  • Inspect the complex parts of the query. If something is complex it’s probably like that for a reason. Find out why.
  • Learn how to read query execution plans and stats. Some SQL query problems can only be solved by getting way more into the weeds than you want too. Yourself a year later will thank yourself that is doing it now.

Conclusion

SQL has always been near and dear to my heart, being on data warehousing teams is how I cut my teeth on working in IT. After years of working around data warehousing, business intelligence, watching software engineer’s struggle with database design and queries… while at the same time being the best programmers…. its become obvious to me that the simple things are what matters.

There are always cases where even the DBA’s have problems, that’s just life. But, most of the time sticking to the fundamentals will get you most of the way there.