AI is already democratized. Now it’s time to operate and secure it: a practical guide to a private “AI Gateway” (à la PrivateGPT) on your perimeter

Published 11/22/2025

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

In many companies, generative AI is no longer a pilot: it’s a daily tool. The next leap isn’t “using more AI,” it’s running it with discipline and bringing it closer to the data. That means moving part of the brain in-house: a private AI Gateway (in the spirit of PrivateGPT), with RAG over internal sources, auditability, DLP, SSO/RBAC, and no prompt or embedding leakage to third parties.

This article is a technical/operational “runbook”: reference architecture, security, sizing, observability, CI/CD for prompts and RAG, and a 12-week plan to move from ad-hoc SaaS to governed AI infrastructure.

Why a private gateway?

Compliance & confidentiality: prompts, attachments, and outputs that must not leave your domain.
Latency & continuity: close to the data, no egress chokepoints or vendor windows.
Predictable cost: intensive workloads on bare-metal GPUs or private cloud stabilize OPEX.
Governance: logging of who/what/when/which source; ingress/egress policies; guardrails.
Specific advantage: your corpus and processes, not the generic model, make the difference.

Reference architecture (optionally with PrivateGPT as gateway)

Access plane (Control Plane)

HTTP/WS Gateway with SSO (OIDC/SAML), RBAC, rate limiting, quotas, and DLP/PII policies.
Policy engine (e.g., OPA/Rego) to decide routing (local model vs. external), what goes in/out, and who sees what.
Audit: signed traces and events (see schema below).

Data plane (RAG)

Connectors to DMS, wiki, CRM/ERP, tickets, shares; ETL → normalization (PDF→text, OCR, cleaning) → chunking → embeddings → vector store.
Vector DB (pgvector/Qdrant/Weaviate/Milvus): namespaces per domain, TTL per collection, at-rest encryption.
DLP/PII scanner at ingestion (signals feeding guardrails).

Model serving plane

LLM host (vLLM/Ollama or commercial orchestrator) for local models (instruct/functions) and routing to external providers where appropriate.
Embeddings service dedicated for throughput and caching.
Output filters (PII/secrets, toxicity, jailbreak).

Network & security

TLS externally; mTLS internally among gateway ↔ RAG ↔ models ↔ vector DB.
KMS/HSM for key management, envelope encryption for sensitive logs.
Zones: DMZ (gateway), app-net (services), data-net (DBs).
Egress control: external destinations allow-listed only; no default route to the internet from RAG pods.

RAG that doesn’t crawl

1) Chunking & metadata

General: chunk_size 800–1200 tokens, overlap 100–200.
Long docs (contracts/manuals): semantic sub-chunking (titles/sections).
Mandatory metadata: source, version, doc_id, section, acl tags, language, timestamp, content hash.

2) Embeddings

384–1024 dims depending on model; normalize and cache (Redis/LFU).
Rule of thumb: 1M chunks ≈ 1–2 GB embeddings (dim=384) / 4–8 GB (dim=1024).
Avoid re-embedding on small edits; version the corpus.

3) Retrieval

Hybrid (BM25 + vector), k = 20; light re-ranker → 6–10 passages to the prompt.
ACL-aware retrieval: filter by acl tags before searching (don’t rely on the LLM for access control).

4) Prompting

Versioned templates; inject sources with internal/verifiable URLs; push answers with citations.
Stop sequences and max tokens per domain.

Sizing & performance: quick rules

Latency targets (95th)

“Fast & useful” policy: embedding search ≤ 80 ms, re-rank ≤ 70 ms, first token ≤ 400 ms.
TTFT rises with model size; use a model mix: small instruct (fast reply) + fallback to larger.

VRAM (coarse estimate)

FP16: ≈ 2 × params (GB); INT8: ≈ 1 × params; INT4: ≈ 0.6 × params.
A 7B INT4 fits ~4–5 GB; 13B INT4 ~8–10 GB; 70B needs multi-GPU or CPU+disk (slow).
Concurrency ≈ min(threads_serving, VRAM / per-request working set).

Vector DB

1M vectors (dim=768) ≈ 3–5 GB; add 30–50% for indexes/metadata.
IOPS matters: local NVMe or flash pool for hot collections.

Cost

Compute €/k tokens internal vs. external; add egress. With heavy workloads, local serving break-even comes quickly.

Security & compliance (to pass audits without sweating)

PII/Secrets shift-left: scan on ingestion and response, with masking and policy-deny.
Retention: TTL for prompts, events, and cache; verified deletion (hash after delete).
Consent & purpose: tag datasets with purpose (accounting, support, legal); avoid cross-reuse.
Identity: SSO with claims: dept, role, region. RAG ACLs consume those claims.
Traceability: each response stores req_id, user, sources[], model, hash_input/output, policy_decisions.
Backups/DR: vector DB backed up (snapshots + WAL/raft logs); runbook to rebuild indexes from sources if needed (and verify hashes).

Observability & audit

Metrics: QPS, TTFT, tokens/s, cache hit-ratio, latencies (p50/p95/p99), retrieval errors, timeouts.
Traces (OpenTelemetry): spans for gateway → RAG → search → re-rank → LLM.
Logs: stable JSON schema and signature for critical events; PII minimization (hash/salt).
Dashboards: per-route latency, embedding drift (quality collapses), cost by use case.

AI DevOps: CI/CD for prompts and RAG

Prompts as code: repos, branching, review, canaries, rollbacks.
Golden sets: fixtures of Q&A per domain (legal, procurement, support).
Evaluation: automate faithfulness, groundedness, toxicity, jailbreak, and factual accuracy in QA.
Red-teaming: prompt injection, data exfiltration, role overwrite, tool abuse; block & alert.

(Illustrative) docker-compose for a minimal stack

services:
  gateway:
    image: yourorg/ai-gateway:latest
    env_file: gateway.env
    ports: ["443:8443"]
    depends_on: [retriever, vllm]
    # Enable TLS and internal mTLS; OIDC for SSO
  retriever:
    image: yourorg/rag-service:latest
    environment:
      VECTOR_URL: http://qdrant:6333
      EMBED_URL: http://embed:8080
    depends_on: [qdrant, embed]
  embed:
    image: yourorg/embeddings:latest
    environment:
      MODEL: "bge-base-en-v1.5"
      CACHE_URL: "redis://cache:6379/0"
  vllm:
    image: vllm/vllm-openai:latest
    environment:
      MODEL: "llama-3-8b-instruct-q4"
      VLLM_WORKERS: "2"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
  qdrant:
    image: qdrant/qdrant:latest
    volumes: ["./qdrant:/qdrant/storage"]
  cache:
    image: redis:7-alpine
Code language: PHP (php)

Replace images with your provider/stack. In production: TLS, mTLS, health checks, resource limits, node affinity, backups, and keep secrets out of compose.

12-week plan

Weeks 1–2 — Discovery & risk control

Inventory use cases and sensitive data.
Quick wins that don’t need regulated data (summaries, internal playbooks).

Weeks 3–5 — Gateway MVP

Deploy gateway with SSO/RBAC, mTLS, and audit.
Minimal RAG: connectors to shares/wiki, vector DB, embeddings, basic templates.
Ingress/egress DLP/PII.

Weeks 6–8 — Calibration & SLAs

Define SLOs (p95 TTFT, p95 end-to-end, availability).
Load tests: target QPS, concurrency, VRAM and vector DB saturation.
Tune chunking, re-rank, and guardrails.

Weeks 9–10 — CI/CD & security

Prompts as code, golden sets, automated evaluation.
Red-team for prompt injection and exfiltration.
Backups, DR, and runbooks.

Weeks 11–12 — Industrialization

Open to more areas with role-based playbooks.
Costing: €/k internal vs external tokens; routing by use case.
Review with Legal/Compliance and final hardening.

Anti-patterns you’ll meet (and how to avoid them)

“All to SaaS” with sensitive data → bimodal pattern; critical flows go through your gateway.
RAG without ACLs → filter by claims/ACL tags before search.
Re-embedding on every minor change → version and cache.
Verbose logs with PII → minimize & hash + short retention.
No golden sets or CI/CD for prompts → silent regressions and “magic” in prod.

Conclusion

We’ve democratized AI; now we must operate it as a critical platform. A private AI Gateway—in the spirit of PrivateGPT—delivers what admins value: security, control, performance, and traceability. The bimodal model (SaaS when it fits, private when it hurts) balances productivity with data sovereignty. Metrics, SLOs, guardrails, and CI/CD do the rest.

Final checklist: SSO/RBAC ✅ mTLS/TLS ✅ DLP/PII ✅ ACL-aware RAG ✅ Signed audit ✅ SLOs/observability ✅ Prompt CI/CD ✅ Red-teaming ✅ Backups/DR ✅

With that in place, AI stops being an experiment and becomes operable infrastructure—and, like any other, it pays off when you keep it under your control.

sources: wharton and noticias inteligencia artificial