In many companies, generative AI is no longer a pilot: it’s a daily tool. The next leap isn’t “using more AI,” it’s running it with discipline and bringing it closer to the data. That means moving part of the brain in-house: a private AI Gateway (in the spirit of PrivateGPT), with RAG over internal sources, auditability, DLP, SSO/RBAC, and no prompt or embedding leakage to third parties.
This article is a technical/operational “runbook”: reference architecture, security, sizing, observability, CI/CD for prompts and RAG, and a 12-week plan to move from ad-hoc SaaS to governed AI infrastructure.
Why a private gateway?
- Compliance & confidentiality: prompts, attachments, and outputs that must not leave your domain.
- Latency & continuity: close to the data, no egress chokepoints or vendor windows.
- Predictable cost: intensive workloads on bare-metal GPUs or private cloud stabilize OPEX.
- Governance: logging of who/what/when/which source; ingress/egress policies; guardrails.
- Specific advantage: your corpus and processes, not the generic model, make the difference.
Reference architecture (optionally with PrivateGPT as gateway)
Access plane (Control Plane)
- HTTP/WS Gateway with SSO (OIDC/SAML), RBAC, rate limiting, quotas, and DLP/PII policies.
- Policy engine (e.g., OPA/Rego) to decide routing (local model vs. external), what goes in/out, and who sees what.
- Audit: signed traces and events (see schema below).
Data plane (RAG)
- Connectors to DMS, wiki, CRM/ERP, tickets, shares; ETL → normalization (PDF→text, OCR, cleaning) → chunking → embeddings → vector store.
- Vector DB (pgvector/Qdrant/Weaviate/Milvus): namespaces per domain, TTL per collection, at-rest encryption.
- DLP/PII scanner at ingestion (signals feeding guardrails).
Model serving plane
- LLM host (vLLM/Ollama or commercial orchestrator) for local models (instruct/functions) and routing to external providers where appropriate.
- Embeddings service dedicated for throughput and caching.
- Output filters (PII/secrets, toxicity, jailbreak).
Network & security
- TLS externally; mTLS internally among gateway ↔ RAG ↔ models ↔ vector DB.
- KMS/HSM for key management, envelope encryption for sensitive logs.
- Zones: DMZ (gateway), app-net (services), data-net (DBs).
- Egress control: external destinations allow-listed only; no default route to the internet from RAG pods.
RAG that doesn’t crawl
1) Chunking & metadata
- General:
chunk_size 800–1200 tokens,overlap 100–200. - Long docs (contracts/manuals): semantic sub-chunking (titles/sections).
- Mandatory metadata: source, version,
doc_id,section,acl tags, language, timestamp, content hash.
2) Embeddings
- 384–1024 dims depending on model; normalize and cache (Redis/LFU).
- Rule of thumb: 1M chunks ≈ 1–2 GB embeddings (dim=384) / 4–8 GB (dim=1024).
- Avoid re-embedding on small edits; version the corpus.
3) Retrieval
- Hybrid (BM25 + vector),
k = 20; light re-ranker → 6–10 passages to the prompt. - ACL-aware retrieval: filter by
acl tagsbefore searching (don’t rely on the LLM for access control).
4) Prompting
- Versioned templates; inject sources with internal/verifiable URLs; push answers with citations.
- Stop sequences and max tokens per domain.
Sizing & performance: quick rules
Latency targets (95th)
- “Fast & useful” policy: embedding search ≤ 80 ms, re-rank ≤ 70 ms, first token ≤ 400 ms.
- TTFT rises with model size; use a model mix: small instruct (fast reply) + fallback to larger.
VRAM (coarse estimate)
- FP16:
≈ 2 × params (GB); INT8:≈ 1 × params; INT4:≈ 0.6 × params. - A 7B INT4 fits ~4–5 GB; 13B INT4 ~8–10 GB; 70B needs multi-GPU or CPU+disk (slow).
- Concurrency ≈
min(threads_serving, VRAM / per-request working set).
Vector DB
- 1M vectors (dim=768) ≈ 3–5 GB; add 30–50% for indexes/metadata.
- IOPS matters: local NVMe or flash pool for hot collections.
Cost
- Compute €/k tokens internal vs. external; add egress. With heavy workloads, local serving break-even comes quickly.
Security & compliance (to pass audits without sweating)
- PII/Secrets shift-left: scan on ingestion and response, with masking and policy-deny.
- Retention: TTL for prompts, events, and cache; verified deletion (hash after delete).
- Consent & purpose: tag datasets with purpose (accounting, support, legal); avoid cross-reuse.
- Identity: SSO with claims:
dept,role,region. RAG ACLs consume those claims. - Traceability: each response stores
req_id,user,sources[],model,hash_input/output,policy_decisions. - Backups/DR: vector DB backed up (snapshots + WAL/raft logs); runbook to rebuild indexes from sources if needed (and verify hashes).
Observability & audit
- Metrics: QPS, TTFT, tokens/s, cache hit-ratio, latencies (p50/p95/p99), retrieval errors, timeouts.
- Traces (OpenTelemetry): spans for gateway → RAG → search → re-rank → LLM.
- Logs: stable JSON schema and signature for critical events; PII minimization (hash/salt).
- Dashboards: per-route latency, embedding drift (quality collapses), cost by use case.
AI DevOps: CI/CD for prompts and RAG
- Prompts as code: repos, branching, review, canaries, rollbacks.
- Golden sets: fixtures of Q&A per domain (legal, procurement, support).
- Evaluation: automate faithfulness, groundedness, toxicity, jailbreak, and factual accuracy in QA.
- Red-teaming: prompt injection, data exfiltration, role overwrite, tool abuse; block & alert.
(Illustrative) docker-compose for a minimal stack
services:
gateway:
image: yourorg/ai-gateway:latest
env_file: gateway.env
ports: ["443:8443"]
depends_on: [retriever, vllm]
# Enable TLS and internal mTLS; OIDC for SSO
retriever:
image: yourorg/rag-service:latest
environment:
VECTOR_URL: http://qdrant:6333
EMBED_URL: http://embed:8080
depends_on: [qdrant, embed]
embed:
image: yourorg/embeddings:latest
environment:
MODEL: "bge-base-en-v1.5"
CACHE_URL: "redis://cache:6379/0"
vllm:
image: vllm/vllm-openai:latest
environment:
MODEL: "llama-3-8b-instruct-q4"
VLLM_WORKERS: "2"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
qdrant:
image: qdrant/qdrant:latest
volumes: ["./qdrant:/qdrant/storage"]
cache:
image: redis:7-alpine
Code language: PHP (php)
Replace images with your provider/stack. In production: TLS, mTLS, health checks, resource limits, node affinity, backups, and keep secrets out of compose.
12-week plan
Weeks 1–2 — Discovery & risk control
- Inventory use cases and sensitive data.
- Quick wins that don’t need regulated data (summaries, internal playbooks).
Weeks 3–5 — Gateway MVP
- Deploy gateway with SSO/RBAC, mTLS, and audit.
- Minimal RAG: connectors to shares/wiki, vector DB, embeddings, basic templates.
- Ingress/egress DLP/PII.
Weeks 6–8 — Calibration & SLAs
- Define SLOs (p95 TTFT, p95 end-to-end, availability).
- Load tests: target QPS, concurrency, VRAM and vector DB saturation.
- Tune chunking, re-rank, and guardrails.
Weeks 9–10 — CI/CD & security
- Prompts as code, golden sets, automated evaluation.
- Red-team for prompt injection and exfiltration.
- Backups, DR, and runbooks.
Weeks 11–12 — Industrialization
- Open to more areas with role-based playbooks.
- Costing: €/k internal vs external tokens; routing by use case.
- Review with Legal/Compliance and final hardening.
Anti-patterns you’ll meet (and how to avoid them)
- “All to SaaS” with sensitive data → bimodal pattern; critical flows go through your gateway.
- RAG without ACLs → filter by claims/ACL tags before search.
- Re-embedding on every minor change → version and cache.
- Verbose logs with PII → minimize & hash + short retention.
- No golden sets or CI/CD for prompts → silent regressions and “magic” in prod.
Conclusion
We’ve democratized AI; now we must operate it as a critical platform. A private AI Gateway—in the spirit of PrivateGPT—delivers what admins value: security, control, performance, and traceability. The bimodal model (SaaS when it fits, private when it hurts) balances productivity with data sovereignty. Metrics, SLOs, guardrails, and CI/CD do the rest.
Final checklist: SSO/RBAC ✅ mTLS/TLS ✅ DLP/PII ✅ ACL-aware RAG ✅ Signed audit ✅ SLOs/observability ✅ Prompt CI/CD ✅ Red-teaming ✅ Backups/DR ✅
With that in place, AI stops being an experiment and becomes operable infrastructure—and, like any other, it pays off when you keep it under your control.
sources: wharton and noticias inteligencia artificial
