Quick Guide to Data Engineering on AWS
If there is a 100lb gorilla in the room, it’s AWS. I remember the first days all those years ago when I first started messing around with AWS, it was kind of hard to know what I didn’t know. Sure, I could cruise the products list on the AWS website, but it wasn’t all that helpful. Different groups using different technology.
DevOps team using one set of tools, Production engineering using another. What tools will most Data Engineers use on AWS, there has to be a 80/20 rule right? I mean sure, some poor soul is using Amazon SageMaker, but come on. I think there a few general categories of tools on AWS that fit into some “buckets.”
I would suggest at least learning the basics of one or two tools in each bucket.
- Big Data both Batch and Streaming.
- Storage – Databases, SQL and NoSQL, and cloud storage.
- Data Warehousing and Analytics / BI.
- Small-Medium Data Processing
- Pipeline Orchestration and Dependency management.
Big Data both Batch and Streaming.
If you’re going to do Big Data on AWS, then you’re probably going to want to learn both EMR (Spark) and Kinesis (Streaming). Both batch and streaming Big Data is going to happen as you progress in your career as a Data Engineer. Might as well jump on the bandwagon. These AWS tools have been around forever and there is a plethora of learning resources available.
Storage – Databases, SQL and NoSQL, and cloud storage.
The classic RDS instances, MySQL, Postgres, SQL Server, old tool wrapped up for you and delivered with a bow. If you’re very lucky you will never run into DynamoDB, but it will probably happen, capable of making you cry, good luck.
Data Warehousing and Analytics / BI.
I pity those who have to use Redshift, but it’s been around awhile and used extensively, MMP data modeling and design is a good tool to have in the belt. Athena is a beast, even if it is a little finicky, and great for some Business Intelligence use cases. Quicksight, do yourself a favor and move onto something else.
Small-Medium Data Processing
This is probably where most folks spend their time and career. Lambda’s, Fargate containers, EC2 instances running like the bandit. I mean you could try to process Big Data with these tools, but it’s not what they are made for. Lambda’s are sweet, but with a 10GB max of memory, good luck. Fargate is annoying, nice, but annoying, you will be disappointed with resource size.
These tools are a must to learn, they are the backbone of a lot of systems.
Pipeline Orchestration and Dependency management.
Of course as a Data Engineering pipeline orchestration and dependency management is core to what we do. Of course AWS took the easy route, thank the Lord, and give use managed Airflow with MWAA. Thanks for that. Then they have the monstrosity of AWS Data Pipeline and AWS Step functions that should be taken out behind the proverbial barn. You can imagine the rest.
Musings
It’s hard to keep up on all the new products coming out. With Snowflake, Databricks, Astronomer and the rest joining the fray how do you know what tools to pick? I don’t know. I look for ease of use, integrations, low code, easy administration, low cost. I just try to keep the big picture in my head, what tools are good for what jobs.
Most tools on AWS were made to serve a particular need, don’t try to shove a square in the round hole. If you’re using DynamoDB like Postgres, then some architect needs to get their head examined. Just try to keep up on whats available and the use cases for those tools, keep the that marketing material at arms length and do your research. Read the docs, check on constraints, usage cost, resource limits etc. This type of stuff will keep you well informed and ready to architect some good data pipelines.
A great read!
Cool article 👍 I’m an engineer and architect for systems and models, not really big data & databases on AWS but I relate everything so big data will become ever more critical in the real time analysis of complex environmental sensors. Thanks.