LLM Fine-Tuning vs Prompting vs RAG: When to Use What?
Table of Contents
LLM Fine-Tuning - This article is part of a series.
LLM Fine-Tuning vs Prompting vs RAG: When to Use What? #
Day 1 of 30-Day LLM Fine-Tuning Journey
When I started working with LLMs, I faced the same question everyone asks: Should I fine-tune, use prompting, or build a RAG system?
After building systems with all three approaches over the past year, I’ve learned that the “best” choice isn’t about the technology—it’s about your constraints. Let me walk you through how to make this decision.
The Three Approaches #
Prompting: The Quick Start #
Prompting is like having a conversation with a smart assistant. You give it instructions and examples, and it figures out what you want.
# Customer service classification example
prompt = """
Classify this customer message:
- BILLING: payment or invoice issues
- TECHNICAL: product not working
- CANCELLATION: wants to cancel service
Message: "My internet has been down for 3 hours"
Category:
"""
When I use it: Quick prototypes, low-volume tasks, when I need something working today.
The catch: Performance varies with prompt quality. Long prompts get expensive fast.
RAG: The Knowledge Expert #
RAG combines your LLM with a searchable knowledge base. Think of it as giving the model access to Google, but for your specific domain.
# Basic RAG setup
class SimpleRAG:
def __init__(self):
self.knowledge_base = [] # Your documents
self.retriever = SentenceTransformer('all-MiniLM-L6-v2')
def ask(self, question):
# Find relevant docs
relevant_docs = self.find_similar(question)
# Ask LLM with context
prompt = f"""
Context: {relevant_docs}
Question: {question}
Answer based on the context:
"""
return self.llm.generate(prompt)
When I use it: When accuracy depends on specific information that changes frequently, like product catalogs or documentation.
The catch: Quality depends entirely on what you put in the knowledge base.
Fine-tuning: The Specialist #
Fine-tuning modifies the model’s parameters to make it better at your specific task. It’s like training a general doctor to become a specialist.
# Fine-tuning setup with Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
# Your training data
training_data = [
{"input": "customer complaint about billing", "output": "I understand your billing concern..."},
# ... more examples
]
trainer = Trainer(
model=model,
train_dataset=prepare_dataset(training_data),
args=training_args
)
trainer.train()
When I use it: High-volume applications where consistency matters, or when I need the model to behave in a very specific way.
The catch: Requires good training data and compute resources.
Understanding the Trade-offs: When Each Approach Works Best #
Domain-Specific Knowledge Tasks #
Example Scenarios: Legal document review, medical diagnosis assistance, technical support
Prompting works when:
- Task involves general knowledge the model already has
- Simple pattern recognition is sufficient
- You can provide good examples in the prompt
RAG excels when:
- You need current, specific information (recent case law, updated medical guidelines)
- Information changes frequently and needs to stay accurate
- You want to show users exactly where answers come from
Fine-tuning wins when:
- The domain has very specialized language (legal jargon, medical terminology)
- Consistency in response format is critical
- You have thousands of examples of domain-specific conversations
Customer-Facing Applications #
Example Scenarios: Chatbots, help desk automation, customer service
Typical patterns observed:
- Start with prompting for quick prototyping and testing user interactions
- Add RAG for product information, policies, and FAQ responses
- Consider fine-tuning when you need consistent brand voice across thousands of interactions
Key consideration: Customer service often needs explainability - RAG’s ability to cite sources makes it valuable for building trust.
Content Generation Tasks #
Example Scenarios: Code documentation, technical writing, marketing copy
Approach selection depends on:
- Prompting: Good for one-off content or when you can provide detailed specifications
- RAG: Ideal when content should reference existing materials or follow established examples
- Fine-tuning: Best when you need to match a very specific style or format consistently
How to Choose: Decision Framework #
Comprehensive Decision Matrix #
Factor | Prompting | RAG | Fine-tuning |
---|---|---|---|
Daily Volume | < 1,000 queries | 1,000-10,000 queries | > 10,000 queries |
Monthly Budget | < $100 | $100-500 | > $500 initial investment |
Time to Deploy | 1-4 hours | 1-3 weeks | 2-8 weeks |
Accuracy Requirements | 60-80% acceptable | 70-85% needed | > 85% required |
Knowledge Updates | Static/rare updates | Frequent (weekly/monthly) | Stable domain patterns |
Consistency Needs | Variable output OK | Some variation acceptable | High consistency critical |
Data Available | 5-50 examples | 1,000+ documents | 1,000+ labeled examples |
Infrastructure Tolerance | API calls only | Moderate complexity | High complexity acceptable |
The Numbers That Matter #
Based on projects I’ve worked on:
Performance Comparison* #
Metric | Prompting | RAG | Fine-tuning |
---|---|---|---|
Accuracy Range | 60-85%** | 70-90%** | 70-95%** |
Consistency | Variable | Moderate | High*** |
Response Time | 1-5 seconds**** | 2-8 seconds**** | 0.5-3 seconds**** |
Domain Adaptation | Limited | Good | Excellent*** |
Setup Complexity | Minimal | Moderate | High |
*Performance varies dramatically by task type, model size, implementation quality, and evaluation methodology **Ranges based on limited published benchmarks and author experience - significant variation exists ***Recent research shows RAG often outperforms fine-tuning, especially for knowledge-intensive tasks ****Highly dependent on model size, hardware, network latency, and implementation
Cost Analysis (10,000 monthly queries)* #
Approach | Monthly Cost | Initial Setup Cost | Break-even Point |
---|---|---|---|
Prompting | $5-50** | $0 | Immediate |
RAG | $50-200*** | $1,000-5,000 | 6-18 months |
Fine-tuning | $20-100**** | $2,000-20,000 | 12-36 months |
*Costs vary dramatically based on model choice, usage patterns, data requirements, and infrastructure decisions **Based on current API pricing but excludes prompt engineering and iteration costs ***Includes vector database hosting, embedding costs, and infrastructure - enterprise solutions cost significantly more ****Training costs include data annotation ($10-50 per example), compute, and infrastructure - ongoing inference varies widely
Timeline Comparison* #
Phase | Prompting | RAG | Fine-tuning |
---|---|---|---|
Setup Time | 1-4 hours | 1-3 weeks | 2-8 weeks |
Data Preparation | 30 minutes | 1-5 days | 1-4 weeks |
Training Time | None | None | 4-48 hours |
Iteration Speed | Immediate | 2-8 hours | 1-3 days |
*Timelines vary significantly based on team experience, data quality, and project complexity
Common Implementation Pitfalls #
1. Jumping to Complex Solutions Too Early #
Many teams assume fine-tuning is the “professional” approach without validating the concept with simpler methods first.
Better approach: Always prototype with prompting to understand the task requirements and establish baseline performance.
2. Underestimating Data Quality Requirements #
Both RAG and fine-tuning are only as good as their underlying data.
For RAG: Poorly structured or outdated documents lead to irrelevant retrievals For Fine-tuning: Inconsistent or biased training data produces unreliable models
3. Ignoring Total Cost of Ownership #
Initial cost comparisons often miss ongoing maintenance, data updates, and infrastructure scaling.
Hidden costs to consider:
- Data annotation and cleaning
- Infrastructure monitoring and scaling
- Model retraining cycles
- Compliance and security audits
Technical Implementation Notes #
Data Requirements Comparison #
Approach | Data Needed | Quality Requirements | Preparation Time |
---|---|---|---|
Prompting | 5-50 examples | High-quality examples | Minutes |
RAG | 1,000-100,000 documents | Factual accuracy, good coverage | Days |
Fine-tuning | 1,000-50,000 training pairs | Consistent labeling, representative | Weeks |
Infrastructure Requirements* #
Component | Prompting | RAG | Fine-tuning |
---|---|---|---|
Compute | API calls only | CPU for retrieval + GPU optional | GPU required for training |
Storage | Minimal (< 1GB) | Vector database (1-100GB) | Model weights (5-50GB) |
Memory | < 1GB | 4-32GB (varies by scale) | 8-80GB (varies by model size) |
Monitoring | API usage tracking | Retrieval quality + response accuracy | Training metrics + model drift |
Maintenance | Prompt optimization | Knowledge base updates | Periodic retraining |
*Requirements scale significantly with model size, data volume, and performance needs
Infrastructure #
Prompting: Just API calls
import openai
def simple_prompt_query(question):
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# Usage
result = simple_prompt_query("Classify this email as urgent or normal: ...")
RAG: Vector database + embedding service + LLM
# Need: ChromaDB/Pinecone + OpenAI/Cohere embeddings + LLM API
import chromadb
from sentence_transformers import SentenceTransformer
import openai
class BasicRAG:
def __init__(self):
self.client = chromadb.Client()
self.collection = self.client.create_collection("docs")
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
def add_documents(self, documents):
embeddings = self.encoder.encode(documents)
self.collection.add(
embeddings=embeddings.tolist(),
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))]
)
def query(self, question, top_k=3):
query_embedding = self.encoder.encode([question])
results = self.collection.query(
query_embeddings=query_embedding.tolist(),
n_results=top_k
)
context = "\n".join(results['documents'][0])
prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Usage
rag = BasicRAG()
rag.add_documents(["Document 1 content...", "Document 2 content..."])
result = rag.query("What does the policy say about returns?")
Fine-tuning: GPU access + model storage + serving infrastructure
# Need: CUDA-enabled GPU, model hosting, monitoring
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import Dataset
import torch
# Check GPU availability
if torch.cuda.is_available():
device = "cuda"
print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
print("Warning: No GPU available, training will be slow")
# Load model and tokenizer
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Add padding token if missing
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare dataset
def prepare_training_data(examples):
inputs = tokenizer(
examples["text"],
truncation=True,
padding=True,
max_length=512,
return_tensors="pt"
)
inputs["labels"] = inputs["input_ids"].clone()
return inputs
# Training configuration
training_args = TrainingArguments(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=100,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
fp16=True, # Memory optimization
)
# Note: This is a simplified example
# Production fine-tuning requires careful data preparation,
# validation sets, and monitoring for overfitting
What’s Coming Next #
Next, I’ll dive into the economics of fine-tuning. We’ll try to figure out exactly when the training investment pays off, using real cost data from different cloud providers.
The goal is simple: by day 30, you’ll know exactly when to use each approach, and you’ll have hands-on experience implementing all three.
References #
- Brown et al. (2020): “Language Models are Few-Shot Learners” - established prompting capabilities
- Lewis et al. (2020): “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - the original RAG paper
- Howard & Ruder (2018): “Universal Language Model Fine-tuning for Text Classification” - transfer learning principles
- Recent empirical studies (2024) showing context-dependent performance trade-offs
- Published benchmarks and industry reports on LLM deployment patterns
Important Note: Recent research indicates that RAG often outperforms fine-tuning for knowledge-intensive tasks, contrary to conventional wisdom. The optimal approach depends heavily on specific use case requirements, data availability, and implementation quality.
Tomorrow: Day 2 - The Economics of LLM Fine-Tuning: ROI Calculator and Use Cases