Engineering Lessons Learned from LLM Fine Tuning
Well, I finally got around to it. What you say? Fine-tuning an LLM, that’s what. I mean all the cool kids are talking about and caring on like it’s the next thing. What can I say … I’m jaded. I’ve been working on ML systems for a good few years now, and I’ve seen the best, and worst.
Most of Machine Learning is Data Engineering. That’s the truth. Is the LLM gold rush any different?
Lessons Learned from fine-tuning OpenLLaMA LLM.
I just wanted to give a few short insights on what it is like to fine-tune an LLM, in my case OpenLLaMA from a Data Engineering perspective.
First, you can go to GitHub and checkout the full repo and code. It contains a nice overview of what it’s like to fine-tune an LLM.
So here are some quick and dirty thoughts … from a Data Engineering perspective.
- You will probably be working on an
Linux
instance to do that actual work. - You will probably heavily use
Docker
because of the above. - There are lots of Python tools to
pip
install and manage. - Playing with LLMs requires ALOT of
memory
ANDdisk
in real life. - Eventually, you will need GPUs (check out
vast.ai
for cheap by-the-hour rentals.) - Because of remote GPU machines,
Docker
etc. You need to understandbash
andssh
command. - Data cleaning and prep is going to be the hardest part and the most code.
- Choose your LLM model upfront because it will affect everything downstream.
- Choose your preferred libraries up front for training and inference (ex,
huggingface
) - Lots of scripts to deploy your
code
anddata
to cloud storage (exs3
) will make your life easier when deploying to a remoteGPU
machines.
I highly recommend using vast.ai to rent cheap GPUs by the hour. Most of your code and headache will be gathering data to train on, because it’s unstructured, and then getting it into a semi-structured format. It’s a pain and takes time. No shortcuts.
If you want more in-depth info …
You can follow along here https://dataengineeringcentral.substack.com/p/llms-part-2-fine-tuning-openllama
If you are not down with LLMs … see Part 1 which gives a high-level overview of LLMs (local inference on a laptop). https://dataengineeringcentral.substack.com/p/demystifying-the-large-language-models