January 5, 2026·8 min read

10 Questions That Determine Your LLM Stack: Decision Framework

A practical framework for LLM deployment decisions. Latency, volume, privacy, ops, budget, flexibility—and why most production stacks go hybrid.

AIInfrastructureDecision Framework

Most teams frame the LLM deployment question wrong.

It's not "self-host vs API." It's not about ideology or cost-cutting theater. The real question is: what does your workload actually need?

This framework helps you answer that systematically.

The Deployment Spectrum

The choice isn't binary. There's a spectrum of options, each with different tradeoffs:

Managed APIs (OpenAI, Anthropic, Google) — lowest ops burden, highest per-token cost
Serverless inference (Together AI, Replicate, Fireworks) — pay-per-use, less vendor lock-in
Dedicated cloud GPU (RunPod, Lambda Labs, cloud instances) — predictable cost, you manage the stack
On-prem self-hosted — full control, full responsibility
Hybrid combinations — route by task type, sensitivity, or complexity

Most production systems end up somewhere in the middle. Before picking, understand your workload.

Ten Questions That Determine Your Stack

1. What's Your Latency Tolerance?

Real-time (chat, voice): Often targets < 500ms time-to-first-token
Interactive (search, suggestions): 1-3 seconds is usually acceptable
Batch (analysis, processing): Minutes or hours is fine

Real-time use cases push you toward dedicated instances or local inference. Batch workloads open up cheaper options—more on that in questions 2 and 10.

2. What's Your Volume Pattern?

Steady load: Dedicated instances often make sense economically
Spiky/unpredictable: Serverless or API handles bursts without idle cost
Batch jobs: Spot/preemptible instances with job queues can dramatically cut costs

For batch workloads that can tolerate interruption, spot instances on major clouds or preemptible GPU providers offer significant savings. Build in retry logic and checkpoint your progress.

3. Do You Actually Need a Frontier Model?

Not every task needs GPT-4 or Claude. Here are starting points—but run evals on your actual data before committing:

Task Type	Starting Point	Notes
Classification	7-8B	Often sufficient with good prompting
Summarization	8-70B	Quality varies by domain complexity
Complex reasoning	70B+ or frontier API	Edge cases matter here
Structured extraction	7-13B	Schema validation catches most issues

The right model depends on your acceptance criteria. A 7B model might hit 92% accuracy on your task—is that enough, or do you need 98%? Only evals on your data can answer that.

4. What Are Your Privacy and Compliance Constraints?

No constraints: APIs are fine
Data can't leave your cloud: Self-host in your VPC
Data can't leave your premises: On-prem only
Regulated industry (healthcare, finance): May need audit trails, specific certifications, or data residency guarantees

Privacy requirements often override cost considerations entirely.

5. What's Your Team's Capacity and Appetite for ML Ops?

This is two questions in one:

Capability: Do you have engineers who can set up and maintain GPU infrastructure, model serving, and inference optimization? If not, can you hire them? Self-hosting requires skills that many teams don't have in-house.

Appetite: Even if you can do it, do you want to? Self-hosting means:

GPU monitoring and scaling
Model updates and version management
Security patches and dependency maintenance
On-call for inference failures

Time to market matters too. APIs get you to production in days. Self-hosting takes weeks to set up properly—longer if you're learning as you go.

If your team is already stretched thin or lacks ML infrastructure experience, the "savings" from self-hosting can evaporate in ops overhead and delayed launches.

6. What's Your Reliability and Dependency Risk?

This one gets overlooked:

Rate limits: Can you handle API throttling during traffic spikes?
Outages: What happens when your provider goes down? (It happens.)
Vendor changes: Pricing changes, model deprecations, API breaking changes
Fallback strategy: Do you have one?

Self-hosting gives you control but shifts reliability risk to your infrastructure. APIs give you someone else's SLA but add dependency risk. Hybrid approaches can mitigate both.

7. Can You Specialize Instead of Scaling Up?

Before reaching for a bigger model, consider making a smaller one work harder:

Better prompting + validation: Schema enforcement, retries, post-processing. Often enough.
RAG: When the issue is knowledge, not reasoning. Ground the model in your docs.
Fine-tuning (LoRA/QLoRA): For narrow, high-volume tasks. A fine-tuned 7B can match or outperform frontier models on well-defined tasks (measured by your evals).
Distillation: Use frontier models to label training data, then run a smaller model in production.

Decision rule: Narrow task + high volume → specialize. Open-ended reasoning → frontier fallback.

8. Is Your Workload Standard Text, or Something Else?

Text-to-text LLMs are the default assumption, but other modalities change the calculus:

Vision/image understanding: API availability varies; some tasks require self-hosting
Image generation: GPU-intensive; dedicated instances often more economical at scale
Video generation: High compute requirements; API options limited and expensive
3D generation: Emerging area; mostly requires self-hosting specialized models
Multimodal pipelines: May need to mix providers or self-host components

For non-text workloads, availability, GPU footprint, cost structure, and licensing can dominate the decision. Some modalities simply don't have good API options yet, forcing self-hosting regardless of your preferences.

Start here: Check API availability for your modality first—that often decides for you.

9. How Locked In Can You Afford to Be?

This affects teams at different stages differently:

Early stage / still experimenting:

You don't know what model works best yet
APIs let you swap models in minutes (change one API call)
Self-hosting locks you into your choice—migration is painful
Stay flexible until you've validated

Validated and scaling:

You know what works; now you're optimizing
Self-hosting makes more sense when the model is stable
Lock-in is acceptable because you're not changing often

Need to A/B test frequently:

APIs or serverless make experimentation cheap
Self-hosting multiple models for A/B tests is expensive

Decision rule:

High uncertainty about model choice → stay on APIs
Stable, validated workload → self-hosting is viable
Frequent experimentation needs → prioritize flexibility

10. What Happens to Your Costs at 10x Scale?

Before committing to a stack, project your costs forward. What looks affordable at current volume can become unsustainable at scale.

The audit process:

Project API spend at 10x and 100x current volume. If you're at $500/month now, what happens at $5k or $50k?
Identify which requests MUST be real-time. User-facing chat? Probably yes. Batch document processing? Probably no.
Segment your workload. Create two paths: fast/expensive (API, real-time) and slow/cheap (queued, batched, self-hosted).

Options for the slow/cheap path:

Queue + batch processing on modest hardware
Spot/preemptible GPUs for async workloads
Client-side inference where feasible (e.g., Whisper STT or Kokoro TTS in-browser—zero server cost)

Example: 1,000 document extractions don't need real-time API calls. Queue them, process overnight on a $0.40/hr GPU, get results by morning. Same output, fraction of the cost.

Key question: Which parts of your workload can you slow down to stay within budget?

For startups with limited runway, hitting API cost ceilings can be existential. Model your costs forward before committing.

Why Most Production Stacks Go Hybrid

After working through these questions, most teams land on hybrid: route different requests to different backends based on the tradeoffs.

Why routing exists:

Cost: Use expensive models only when cheaper ones fail
Latency: Fast local models for simple requests, APIs for complex ones
Privacy: Sensitive data stays on your infrastructure
Resilience: Fallbacks when any single provider fails

The proportion handled locally vs. via API varies widely—there's no universal ratio. It depends on your task distribution, acceptance criteria, and how well your models are tuned.

Implementation details (routing rules, confidence thresholds, fallback strategies) deserve their own deep-dive—that's a separate article.

Decision Matrix

Your Situation	Starting Recommendation
< $1k/month API spend, no privacy needs	Stay on APIs—ops cost likely exceeds savings
High volume, simple/repetitive tasks	Self-host small model, validate with evals
High volume, narrow task	Fine-tune or distill a small model
Mixed complexity across requests	Hybrid routing with confidence-based escalation
Strict privacy or compliance requirements	Self-host or private cloud deployment
Unpredictable volume, low ops capacity	Serverless or API with usage caps
Mission-critical, need fallbacks	Hybrid with multi-provider redundancy
Non-text workloads (vision/video/3D)	Evaluate API availability first; self-host if needed
Batch processing, cost-sensitive	Spot/preemptible GPUs with job queues
Early stage, still validating	APIs for flexibility; don't over-invest yet
Approaching API budget ceiling	Model breakeven; self-host if math works

What This Framework Doesn't Give You

This helps you think about the decision. It doesn't provide:

Exact cost calculations for your specific volume and task mix
Architecture design for your infrastructure
Implementation plan or timeline

Those require looking at your actual workload, running evals, and modeling costs against your projected usage.

When in doubt: Start with APIs. They're faster to ship, easier to change, and you'll learn what you actually need. Optimize once you have real usage data.

Have questions about your LLM deployment strategy? For quick questions, connect on LinkedIn. For consulting inquiries, get in touch.