Back to blog
·8 min read

10 Questions That Determine Your LLM Stack: Decision Framework

A practical framework for LLM deployment decisions. Latency, volume, privacy, ops, budget, flexibility—and why most production stacks go hybrid.

AIInfrastructureDecision Framework

Most teams frame the LLM deployment question wrong.

It's not "self-host vs API." It's not about ideology or cost-cutting theater. The real question is: what does your workload actually need?

This framework helps you answer that systematically.

The Deployment Spectrum

The choice isn't binary. There's a spectrum of options, each with different tradeoffs:

  • Managed APIs (OpenAI, Anthropic, Google) — lowest ops burden, highest per-token cost
  • Serverless inference (Together AI, Replicate, Fireworks) — pay-per-use, less vendor lock-in
  • Dedicated cloud GPU (RunPod, Lambda Labs, cloud instances) — predictable cost, you manage the stack
  • On-prem self-hosted — full control, full responsibility
  • Hybrid combinations — route by task type, sensitivity, or complexity

Most production systems end up somewhere in the middle. Before picking, understand your workload.

Ten Questions That Determine Your Stack

1. What's Your Latency Tolerance?

  • Real-time (chat, voice): Often targets < 500ms time-to-first-token
  • Interactive (search, suggestions): 1-3 seconds is usually acceptable
  • Batch (analysis, processing): Minutes or hours is fine

Real-time use cases push you toward dedicated instances or local inference. Batch workloads open up cheaper options—more on that in questions 2 and 10.

2. What's Your Volume Pattern?

  • Steady load: Dedicated instances often make sense economically
  • Spiky/unpredictable: Serverless or API handles bursts without idle cost
  • Batch jobs: Spot/preemptible instances with job queues can dramatically cut costs

For batch workloads that can tolerate interruption, spot instances on major clouds or preemptible GPU providers offer significant savings. Build in retry logic and checkpoint your progress.

3. Do You Actually Need a Frontier Model?

Not every task needs GPT-4 or Claude. Here are starting points—but run evals on your actual data before committing:

Task TypeStarting PointNotes
Classification7-8BOften sufficient with good prompting
Summarization8-70BQuality varies by domain complexity
Complex reasoning70B+ or frontier APIEdge cases matter here
Structured extraction7-13BSchema validation catches most issues

The right model depends on your acceptance criteria. A 7B model might hit 92% accuracy on your task—is that enough, or do you need 98%? Only evals on your data can answer that.

4. What Are Your Privacy and Compliance Constraints?

  • No constraints: APIs are fine
  • Data can't leave your cloud: Self-host in your VPC
  • Data can't leave your premises: On-prem only
  • Regulated industry (healthcare, finance): May need audit trails, specific certifications, or data residency guarantees

Privacy requirements often override cost considerations entirely.

5. What's Your Team's Capacity and Appetite for ML Ops?

This is two questions in one:

Capability: Do you have engineers who can set up and maintain GPU infrastructure, model serving, and inference optimization? If not, can you hire them? Self-hosting requires skills that many teams don't have in-house.

Appetite: Even if you can do it, do you want to? Self-hosting means:

  • GPU monitoring and scaling
  • Model updates and version management
  • Security patches and dependency maintenance
  • On-call for inference failures

Time to market matters too. APIs get you to production in days. Self-hosting takes weeks to set up properly—longer if you're learning as you go.

If your team is already stretched thin or lacks ML infrastructure experience, the "savings" from self-hosting can evaporate in ops overhead and delayed launches.

6. What's Your Reliability and Dependency Risk?

This one gets overlooked:

  • Rate limits: Can you handle API throttling during traffic spikes?
  • Outages: What happens when your provider goes down? (It happens.)
  • Vendor changes: Pricing changes, model deprecations, API breaking changes
  • Fallback strategy: Do you have one?

Self-hosting gives you control but shifts reliability risk to your infrastructure. APIs give you someone else's SLA but add dependency risk. Hybrid approaches can mitigate both.

7. Can You Specialize Instead of Scaling Up?

Before reaching for a bigger model, consider making a smaller one work harder:

  • Better prompting + validation: Schema enforcement, retries, post-processing. Often enough.
  • RAG: When the issue is knowledge, not reasoning. Ground the model in your docs.
  • Fine-tuning (LoRA/QLoRA): For narrow, high-volume tasks. A fine-tuned 7B can match or outperform frontier models on well-defined tasks (measured by your evals).
  • Distillation: Use frontier models to label training data, then run a smaller model in production.

Decision rule: Narrow task + high volume → specialize. Open-ended reasoning → frontier fallback.

8. Is Your Workload Standard Text, or Something Else?

Text-to-text LLMs are the default assumption, but other modalities change the calculus:

  • Vision/image understanding: API availability varies; some tasks require self-hosting
  • Image generation: GPU-intensive; dedicated instances often more economical at scale
  • Video generation: High compute requirements; API options limited and expensive
  • 3D generation: Emerging area; mostly requires self-hosting specialized models
  • Multimodal pipelines: May need to mix providers or self-host components

For non-text workloads, availability, GPU footprint, cost structure, and licensing can dominate the decision. Some modalities simply don't have good API options yet, forcing self-hosting regardless of your preferences.

Start here: Check API availability for your modality first—that often decides for you.

9. How Locked In Can You Afford to Be?

This affects teams at different stages differently:

Early stage / still experimenting:

  • You don't know what model works best yet
  • APIs let you swap models in minutes (change one API call)
  • Self-hosting locks you into your choice—migration is painful
  • Stay flexible until you've validated

Validated and scaling:

  • You know what works; now you're optimizing
  • Self-hosting makes more sense when the model is stable
  • Lock-in is acceptable because you're not changing often

Need to A/B test frequently:

  • APIs or serverless make experimentation cheap
  • Self-hosting multiple models for A/B tests is expensive

Decision rule:

  • High uncertainty about model choice → stay on APIs
  • Stable, validated workload → self-hosting is viable
  • Frequent experimentation needs → prioritize flexibility

10. What Happens to Your Costs at 10x Scale?

Before committing to a stack, project your costs forward. What looks affordable at current volume can become unsustainable at scale.

The audit process:

  1. Project API spend at 10x and 100x current volume. If you're at $500/month now, what happens at $5k or $50k?

  2. Identify which requests MUST be real-time. User-facing chat? Probably yes. Batch document processing? Probably no.

  3. Segment your workload. Create two paths: fast/expensive (API, real-time) and slow/cheap (queued, batched, self-hosted).

Options for the slow/cheap path:

  • Queue + batch processing on modest hardware
  • Spot/preemptible GPUs for async workloads
  • Client-side inference where feasible (e.g., Whisper STT or Kokoro TTS in-browser—zero server cost)

Example: 1,000 document extractions don't need real-time API calls. Queue them, process overnight on a $0.40/hr GPU, get results by morning. Same output, fraction of the cost.

Key question: Which parts of your workload can you slow down to stay within budget?

For startups with limited runway, hitting API cost ceilings can be existential. Model your costs forward before committing.

Why Most Production Stacks Go Hybrid

After working through these questions, most teams land on hybrid: route different requests to different backends based on the tradeoffs.

Why routing exists:

  • Cost: Use expensive models only when cheaper ones fail
  • Latency: Fast local models for simple requests, APIs for complex ones
  • Privacy: Sensitive data stays on your infrastructure
  • Resilience: Fallbacks when any single provider fails

The proportion handled locally vs. via API varies widely—there's no universal ratio. It depends on your task distribution, acceptance criteria, and how well your models are tuned.

Implementation details (routing rules, confidence thresholds, fallback strategies) deserve their own deep-dive—that's a separate article.

Decision Matrix

Your SituationStarting Recommendation
< $1k/month API spend, no privacy needsStay on APIs—ops cost likely exceeds savings
High volume, simple/repetitive tasksSelf-host small model, validate with evals
High volume, narrow taskFine-tune or distill a small model
Mixed complexity across requestsHybrid routing with confidence-based escalation
Strict privacy or compliance requirementsSelf-host or private cloud deployment
Unpredictable volume, low ops capacityServerless or API with usage caps
Mission-critical, need fallbacksHybrid with multi-provider redundancy
Non-text workloads (vision/video/3D)Evaluate API availability first; self-host if needed
Batch processing, cost-sensitiveSpot/preemptible GPUs with job queues
Early stage, still validatingAPIs for flexibility; don't over-invest yet
Approaching API budget ceilingModel breakeven; self-host if math works

What This Framework Doesn't Give You

This helps you think about the decision. It doesn't provide:

  • Exact cost calculations for your specific volume and task mix
  • Architecture design for your infrastructure
  • Implementation plan or timeline

Those require looking at your actual workload, running evals, and modeling costs against your projected usage.

When in doubt: Start with APIs. They're faster to ship, easier to change, and you'll learn what you actually need. Optimize once you have real usage data.


Have questions about your LLM deployment strategy? For quick questions, connect on LinkedIn. For consulting inquiries, get in touch.