boto3 or aws cli for s3 … Python vs Bash and other thoughts.
For any Data Engineer working on aws
for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3
a bucket on aws
is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, copying files, checking for files, getting the last modified file, checking file sizes, downloading files … it pretty much never ends.
Luckily aws
provides a few tools to make these easy, their handy cli
for command-line work, or the trusty boto3
Python package. I want to give an introduction to the common commands Data Engineers have to run with both the aws cli
and boto3
to perform various common tasks. We will then compare and contrast which tool to use in our pipelines and the pros and cons of each.
The two aws
options for s3
… boto3
and aws cli
I’ve spent more time than I care to think about messing with files on s3
. It’s one of those tasks that you get numb to, the list of reasons to mess with files in s3
is endless, and there are always new reasons that pop up with every new project. For those that are newer to working with files in s3
I want to go over some of the common ways to do those tasks.
First, let’s start with the aws cli
command-line tool.
aws cli
AWS
provides a wonderful command-line style tool, it works perfectly doing various tasks on s3
. The installation is very easy and requires very little setup. Once installed you will get a .aws
folder at the root of where you installed it. There will be two files of note …
- config
- credentials
They are self-explanatory, config
allow you to set up different accounts, say for dev and prod if you have them. The credentials
is where your keys are stored. But enough of that, what are the common aws cli
s3
commands you will find yourself running?
- listing files –
aws s3 ls s3://my_bucket/some_folder
- copying files –
aws s3 cp some_local_folder s3://my_bucket/some_folder
- sync folders –
aws s3 sync s3://my_bucket/my_folder s3://other_bucket/another_folder
- of note is the
--recursive
option, if we have multiple sub-directories etc we are working with. - we can use options to control what files we move or sync, or the other commands
--delete --exclude "*some_files.csv*"
--include "*.txt"
- delete
- exclude
- include
- recursive
These commands cover about %90 of what a data engineer will probably do day-to-day when working with files on s3
. You will probably come to know these commands and options by heart if you haven’t already. ( BTW, you can install the aws cli
via pip
, pretty nice feature)
boto3
If I’m not using bash
files to automate aws cli
commands to shove around s3
files for some CI/CD thingy-ma-bob, I find myself often using the Python package boto3
. It’s another great way, although sometimes the more annoying option to code s3
actions. It isn’t that often used in “big data” because I wouldn’t call it that performant, but you can do just about anything you can think of to s3
with boto3
.
The list of actions is endless, but mainly I find myself using the following features of boto3
to mess with s3
files and buckets.
- list and paginate s3 bucket contents
- find last modified file
- filter bucket contents
- download a file(s) or folder(s)
- copy a file of folder(s)
- upload file(s).
I typically always use a client
session of boto3
to do most things on s3
. Of course, your aws
keys need to be available in the environment variables, never keep them in the code.
s3_client = boto3.client('s3')
Many times I will use boto3
to paginate through all the records in a s3
bucket (paging through contents is common practice, similar to the concept when using REST APIs).
def get_pages(client: object, bucket) -> list:
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket)
pages = [page['Contents'] for page in page_iterator]
return pages
Or maybe I want to look through all the pages
to find files matching a prefix, and get that last modified file.
def get_latest_file(pages: iter, prefix: str) -> str:
if pages:
all_files = []
for page in pages:
all_files.extend(page)
files = [(file['Key'], file['LastModified']) for file in all_files if prefix in file['Key']]
files.sort(reverse=True, key=lambda x: x[1])
recent_file = files[0][0]
print(recent_file)
return
I personally find boto3
a little verbose to use, some things you have to do manually seem like they should come out of the box. I mean, it is Python after all. But don’t get me wrong, some things like copying or deleting files are pretty straightforward.
def copy_s3_file(client: object, key: str, new_key: str) -> None:
copy_source = {
'Bucket': 'my-wonderful-bucket',
'Key': key
}
client.copy(copy_source, 'some-other-wonderful-bucket', new_key)
def delete_object(client: object, key: str) -> None:
client.delete_object(Bucket='my-wonderful-bucke', Key=key)
Other Thoughts.
I use a mix of both bash
/ aws cli
and boto3
in most of my production code writing. They both have their uses and places in this world.
aws cli
for quick development work.aws cli
+bash
for most CI/CD and other infastructure and deployment jobs.boto3
for production code.aws cli
for large amounts of work needing to be done ins3
.boto3
for intricates3
work.
I’ve seen a lot of people use Python subprocess
calls to kick-off aws cli
in Production code. I don’t like this approach and it rarely works out well in my opinion … it’s never unit tested and never catches or handles errors or exceptions very well. On the other hand, boto3
is great for intricate aws
s3 work, you can unit test the crud out of it and handle all sorts of exceptions and problems.
You should learn to use both.
Different situations call for different solutions … if you need to resort to boto3
when you’re doing quick development or exploratory work … you’re probably going to waste a lot of time. You need to learn the aws cli
for quick and dirty work. On the other hand, learning all the nuances of boto3
for s3
file and folder manipulation can get a little old, but if you write your functions correctly you can pretty much re-use them over and over again.
I’m curious to know what you use most of the time with your s3
work, cli
or boto3
, or something else? Drop a comment and let me know.
“The two aws options for s3” I am missing s3fs and awswrangler here.