In many companies, generative AI is no longer a pilot: it’s a daily tool. The next leap isn’t “using more AI,” it’s running it with discipline and bringing it closer to the data. That means moving part of the brain in-house: a private AI Gateway (in the spirit of PrivateGPT), with RAG over internal sources, auditability, DLP, SSO/RBAC, and no prompt or embedding leakage to third parties.

This article is a technical/operational “runbook”: reference architecture, security, sizing, observability, CI/CD for prompts and RAG, and a 12-week plan to move from ad-hoc SaaS to governed AI infrastructure.


Why a private gateway?

  • Compliance & confidentiality: prompts, attachments, and outputs that must not leave your domain.
  • Latency & continuity: close to the data, no egress chokepoints or vendor windows.
  • Predictable cost: intensive workloads on bare-metal GPUs or private cloud stabilize OPEX.
  • Governance: logging of who/what/when/which source; ingress/egress policies; guardrails.
  • Specific advantage: your corpus and processes, not the generic model, make the difference.

Reference architecture (optionally with PrivateGPT as gateway)

Access plane (Control Plane)

  • HTTP/WS Gateway with SSO (OIDC/SAML), RBAC, rate limiting, quotas, and DLP/PII policies.
  • Policy engine (e.g., OPA/Rego) to decide routing (local model vs. external), what goes in/out, and who sees what.
  • Audit: signed traces and events (see schema below).

Data plane (RAG)

  • Connectors to DMS, wiki, CRM/ERP, tickets, shares; ETL → normalization (PDF→text, OCR, cleaning) → chunkingembeddingsvector store.
  • Vector DB (pgvector/Qdrant/Weaviate/Milvus): namespaces per domain, TTL per collection, at-rest encryption.
  • DLP/PII scanner at ingestion (signals feeding guardrails).

Model serving plane

  • LLM host (vLLM/Ollama or commercial orchestrator) for local models (instruct/functions) and routing to external providers where appropriate.
  • Embeddings service dedicated for throughput and caching.
  • Output filters (PII/secrets, toxicity, jailbreak).

Network & security

  • TLS externally; mTLS internally among gateway ↔ RAG ↔ models ↔ vector DB.
  • KMS/HSM for key management, envelope encryption for sensitive logs.
  • Zones: DMZ (gateway), app-net (services), data-net (DBs).
  • Egress control: external destinations allow-listed only; no default route to the internet from RAG pods.

RAG that doesn’t crawl

1) Chunking & metadata

  • General: chunk_size 800–1200 tokens, overlap 100–200.
  • Long docs (contracts/manuals): semantic sub-chunking (titles/sections).
  • Mandatory metadata: source, version, doc_id, section, acl tags, language, timestamp, content hash.

2) Embeddings

  • 384–1024 dims depending on model; normalize and cache (Redis/LFU).
  • Rule of thumb: 1M chunks ≈ 1–2 GB embeddings (dim=384) / 4–8 GB (dim=1024).
  • Avoid re-embedding on small edits; version the corpus.

3) Retrieval

  • Hybrid (BM25 + vector), k = 20; light re-ranker → 6–10 passages to the prompt.
  • ACL-aware retrieval: filter by acl tags before searching (don’t rely on the LLM for access control).

4) Prompting

  • Versioned templates; inject sources with internal/verifiable URLs; push answers with citations.
  • Stop sequences and max tokens per domain.

Sizing & performance: quick rules

Latency targets (95th)

  • “Fast & useful” policy: embedding search ≤ 80 ms, re-rank ≤ 70 ms, first token ≤ 400 ms.
  • TTFT rises with model size; use a model mix: small instruct (fast reply) + fallback to larger.

VRAM (coarse estimate)

  • FP16: ≈ 2 × params (GB); INT8: ≈ 1 × params; INT4: ≈ 0.6 × params.
  • A 7B INT4 fits ~4–5 GB; 13B INT4 ~8–10 GB; 70B needs multi-GPU or CPU+disk (slow).
  • Concurrency ≈ min(threads_serving, VRAM / per-request working set).

Vector DB

  • 1M vectors (dim=768) ≈ 3–5 GB; add 30–50% for indexes/metadata.
  • IOPS matters: local NVMe or flash pool for hot collections.

Cost

  • Compute €/k tokens internal vs. external; add egress. With heavy workloads, local serving break-even comes quickly.

Security & compliance (to pass audits without sweating)

  • PII/Secrets shift-left: scan on ingestion and response, with masking and policy-deny.
  • Retention: TTL for prompts, events, and cache; verified deletion (hash after delete).
  • Consent & purpose: tag datasets with purpose (accounting, support, legal); avoid cross-reuse.
  • Identity: SSO with claims: dept, role, region. RAG ACLs consume those claims.
  • Traceability: each response stores req_id, user, sources[], model, hash_input/output, policy_decisions.
  • Backups/DR: vector DB backed up (snapshots + WAL/raft logs); runbook to rebuild indexes from sources if needed (and verify hashes).

Observability & audit

  • Metrics: QPS, TTFT, tokens/s, cache hit-ratio, latencies (p50/p95/p99), retrieval errors, timeouts.
  • Traces (OpenTelemetry): spans for gateway → RAG → search → re-rank → LLM.
  • Logs: stable JSON schema and signature for critical events; PII minimization (hash/salt).
  • Dashboards: per-route latency, embedding drift (quality collapses), cost by use case.

AI DevOps: CI/CD for prompts and RAG

  • Prompts as code: repos, branching, review, canaries, rollbacks.
  • Golden sets: fixtures of Q&A per domain (legal, procurement, support).
  • Evaluation: automate faithfulness, groundedness, toxicity, jailbreak, and factual accuracy in QA.
  • Red-teaming: prompt injection, data exfiltration, role overwrite, tool abuse; block & alert.

(Illustrative) docker-compose for a minimal stack

services:
  gateway:
    image: yourorg/ai-gateway:latest
    env_file: gateway.env
    ports: ["443:8443"]
    depends_on: [retriever, vllm]
    # Enable TLS and internal mTLS; OIDC for SSO
  retriever:
    image: yourorg/rag-service:latest
    environment:
      VECTOR_URL: http://qdrant:6333
      EMBED_URL: http://embed:8080
    depends_on: [qdrant, embed]
  embed:
    image: yourorg/embeddings:latest
    environment:
      MODEL: "bge-base-en-v1.5"
      CACHE_URL: "redis://cache:6379/0"
  vllm:
    image: vllm/vllm-openai:latest
    environment:
      MODEL: "llama-3-8b-instruct-q4"
      VLLM_WORKERS: "2"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
  qdrant:
    image: qdrant/qdrant:latest
    volumes: ["./qdrant:/qdrant/storage"]
  cache:
    image: redis:7-alpine
Code language: PHP (php)

Replace images with your provider/stack. In production: TLS, mTLS, health checks, resource limits, node affinity, backups, and keep secrets out of compose.


12-week plan

Weeks 1–2 — Discovery & risk control

  • Inventory use cases and sensitive data.
  • Quick wins that don’t need regulated data (summaries, internal playbooks).

Weeks 3–5 — Gateway MVP

  • Deploy gateway with SSO/RBAC, mTLS, and audit.
  • Minimal RAG: connectors to shares/wiki, vector DB, embeddings, basic templates.
  • Ingress/egress DLP/PII.

Weeks 6–8 — Calibration & SLAs

  • Define SLOs (p95 TTFT, p95 end-to-end, availability).
  • Load tests: target QPS, concurrency, VRAM and vector DB saturation.
  • Tune chunking, re-rank, and guardrails.

Weeks 9–10 — CI/CD & security

  • Prompts as code, golden sets, automated evaluation.
  • Red-team for prompt injection and exfiltration.
  • Backups, DR, and runbooks.

Weeks 11–12 — Industrialization

  • Open to more areas with role-based playbooks.
  • Costing: €/k internal vs external tokens; routing by use case.
  • Review with Legal/Compliance and final hardening.

Anti-patterns you’ll meet (and how to avoid them)

  • “All to SaaS” with sensitive data → bimodal pattern; critical flows go through your gateway.
  • RAG without ACLs → filter by claims/ACL tags before search.
  • Re-embedding on every minor change → version and cache.
  • Verbose logs with PII → minimize & hash + short retention.
  • No golden sets or CI/CD for prompts → silent regressions and “magic” in prod.

Conclusion

We’ve democratized AI; now we must operate it as a critical platform. A private AI Gateway—in the spirit of PrivateGPT—delivers what admins value: security, control, performance, and traceability. The bimodal model (SaaS when it fits, private when it hurts) balances productivity with data sovereignty. Metrics, SLOs, guardrails, and CI/CD do the rest.

Final checklist: SSO/RBAC ✅ mTLS/TLS ✅ DLP/PII ✅ ACL-aware RAG ✅ Signed audit ✅ SLOs/observability ✅ Prompt CI/CD ✅ Red-teaming ✅ Backups/DR ✅

With that in place, AI stops being an experiment and becomes operable infrastructure—and, like any other, it pays off when you keep it under your control.

sources: wharton and noticias inteligencia artificial

Scroll to Top