Skip to main content
💡Explanation

Architecture

prAxIs OS's architecture addresses three fundamental challenges in AI-assisted development: context overflow, phase skipping, and language-specific brittleness. This document explains the system's design, the reasoning behind key decisions, and the trade-offs involved.


Overview

prAxIs OS consists of four integrated systems:

  1. MCP Server - Model Context Protocol server providing tool discovery and execution
  2. RAG Engine - Semantic search driving query-first behavior and 71% fewer messages
  3. Workflow Engine - Phase-gated execution with architectural enforcement
  4. Standards System - Universal patterns with language-specific generation

Each system solves a specific failure mode in traditional AI-assisted development.


The Context Problem

Background: Why Context Matters

Large Language Models have a fundamental limitation: attention degradation. When context exceeds 15-25% of the model's window, attention quality drops below 70%, leading to:

  • Missed critical details (especially "lost in the middle")
  • Contradictory implementations
  • Ignored safety rules
  • Hallucinated APIs

Traditional approaches failed:

  • .cursorrules files (static, always loaded, 50KB+)
  • Code comments (scattered, not discoverable)
  • LLM training data (outdated, not project-specific)

Design Decision: MCP + RAG

Decision: Use Model Context Protocol server with Retrieval Augmented Generation for just-in-time content delivery.

Rationale:

  1. Dynamic retrieval replaces upfront loading (50KB → 2-5KB)
  2. Semantic search finds relevant content regardless of exact keywords
  3. Standardized protocol (MCP) enables tool discovery across AI platforms
  4. Local embeddings avoid API costs and latency

Architecture:

Cursor AI Agent
Claude, GPT-4, etc. via MCP client
MCP Protocol
MCP Server (stdio transport)
Tool Registry
• search_standards
• start_workflow
• complete_phase
• get_workflow_state
RAG Engine
• Vector search
• Chunking
• Embeddings
Workflow Engine
• Phase gating
• State mgmt
• Checkpoints
📁 .praxis-os/
universal/
standards/
.cache/vector_index
📁 .praxis-os/
state/
session_*.json
.lock

Trade-offs:

BenefitLimitation
90% context reduction per queryRequires indexing (60s first run)
Semantic discovery (not keyword-dependent)Vector search ~50ms overhead
Local embeddings (no API costs)~200MB memory footprint
Works offlineIndex rebuild needed on content changes
Precise relevance scoringMay miss edge-case synonyms

Alternatives Considered:

  1. Keyword-based search (ripgrep, grep)

    • Why not: Requires exact text matches, misses semantic similarity
    • Example: Query "handle concurrent access" wouldn't find "race conditions"
  2. Full-text search (Elasticsearch, Solr)

    • Why not: Overkill for single-project use, requires separate service
    • Example: 500MB+ memory, network latency, complex setup
  3. Cursor Rules only

    • Why not: All content loaded upfront, 96% irrelevant, context overflow
    • Example: 50KB loaded, only 2KB relevant, 12,500 tokens wasted
  4. API-based embeddings (OpenAI)

    • Why not: Cost per query, latency, requires internet, privacy concerns
    • Example: $0.0001 per query × 1000 queries = $0.10, but adds 200ms latency

RAG Engine Design

Chunking Strategy

Decision: Semantic chunking that respects markdown structure.

Implementation:

def chunk_markdown(content: str) -> List[Chunk]:
"""
Split markdown by semantic boundaries:
- Headers (h1-h6)
- Code blocks (maintain integrity)
- Lists (keep grouped)
- Paragraphs (natural breaks)

Target: 200-800 tokens per chunk for optimal embedding quality.
"""

Why semantic, not fixed-size:

  • Fixed-size chunks (e.g., 512 tokens) split mid-sentence, mid-code
  • Semantic chunks maintain context integrity
  • Headers provide natural topic boundaries
  • Code blocks must stay intact for understanding

Trade-offs:

  • Pro: Contextually complete chunks, better search relevance
  • Con: Variable chunk sizes (100-1000 tokens), less predictable memory

Vector Embeddings

Decision: Local sentence-transformers (all-MiniLM-L6-v2) stored in LanceDB.

Rationale:

  1. Local execution - No API dependency, works offline
  2. Fast - 384-dimensional vectors, efficient cosine similarity
  3. Good quality - MTEB benchmark: 58.8% (sufficient for code/docs)
  4. Small - 80MB model size
  5. Free - No per-query costs

Performance characteristics:

MetricValueContext
Query latency~45ms95th percentile: 82ms
Embedding time~5ms/chunkParallelized on indexing
Memory usage~200MBIncludes model + index
Index build~60s500 markdown files
Relevance95%Human-evaluated on test queries

Alternative: OpenAI embeddings (text-embedding-3-small)

  • Pro: Higher quality (62.3% MTEB), 1536 dimensions
  • Con: $0.02 per 1M tokens, 150ms latency, requires API key
  • Decision: Local is sufficient, avoid external dependencies

Context Reduction

The "Lost in the Middle" problem:

Research shows LLMs struggle with information buried in long contexts. Relevant content at position 30% (in a 10,000-token context) has 65% attention quality.

Our solution:

  • Load only top 3-5 chunks (2-5KB total)
  • Chunks ranked by cosine similarity
  • Present in order of relevance (most relevant first)
❌ Before (Cursor Rules)
Context:50KB standards file
Relevant:2KB (4%)
Irrelevant:48KB (96%)
Attention quality:~60%
Success rate:70%
✅ After (MCP/RAG)
Context:3 chunks × 800 tokens = 2.4KB
Relevant:2.3KB (95%)
Irrelevant:0.1KB (5%)
Attention quality:~95%
Success rate:92%

Workflow Engine Design

The Phase Skipping Problem

Background: AI agents have a strong tendency to "shortcut" multi-phase processes:

  • Skip requirements → jump to implementation
  • Skip design → code directly
  • Skip tests → mark as "TODO"

Traditional approaches failed:

  • ❌ "Please don't skip phases" (ignored 70% of the time)
  • ❌ Bold warnings "DO NOT SKIP" (ignored 50% of the time)
  • ❌ Repeated reminders (attention degradation)

Design Decision: Architectural Enforcement

Decision: Phase gating enforced in code, not documentation.

Implementation:

class WorkflowState:
current_phase: int = 0
completed_phases: List[int] = []

def get_phase_content(self, phase: int) -> Optional[PhaseContent]:
"""AI can ONLY access current or completed phases."""
if phase == self.current_phase:
return self._load_phase(phase)
if phase in self.completed_phases:
return self._load_phase(phase) # Review previous
return None # Future phases: structurally inaccessible

def complete_phase(self, phase: int, evidence: dict) -> Result:
"""Validate evidence before advancing."""
if phase != self.current_phase:
return Result.error("Cannot complete non-current phase")

checkpoint = self._load_checkpoint(phase)
missing = checkpoint.validate(evidence)

if missing:
return Result.error(f"Missing evidence: {missing}")

self.completed_phases.append(phase)
self.current_phase = phase + 1
return Result.success()

Why this works:

  • AI cannot call get_phase_content(phase=2) when current_phase=1
  • The tool returns None - no content to act on
  • Phase advancement requires explicit evidence submission
  • State persists across sessions (file-based)

Trade-offs:

BenefitLimitation
100% phase order enforcementCannot parallelize phases
Evidence-based validationRequires explicit checkpoint design
Resumable workflowsState file must be preserved
Audit trail (all evidence logged)More complex than free-form execution

Alternatives Considered:

  1. Documentary enforcement (warnings)

    • Why not: Ignored 50-70% of the time
    • Evidence: Tested with 100 workflows, 68% skipped at least one phase
  2. Human approval between phases

    • Why not: Breaks flow, requires constant human attention
    • Use case: Better for critical systems (we may add optional mode)
  3. Time-based locks (must wait 5 minutes per phase)

    • Why not: Arbitrary, doesn't ensure quality, frustrating
  4. LLM fine-tuning to never skip

    • Why not: No control over user's LLM, fine-tuning is expensive
    • Observation: GPT-4 still skips despite being "more compliant"

Checkpoint System

Design: Each phase defines required evidence in phase.md:

### Phase 1 Validation Gate

🛑 **Checkpoint:**
Before advancing to Phase 2, provide evidence:
- `requirements_documented`: List of all requirements
- `acceptance_criteria`: Testable success criteria
- `stakeholders_identified`: Who approved these?

Workflow engine parses this dynamically:

checkpoint = {
"requirements_documented": {"type": "list", "min_items": 3},
"acceptance_criteria": {"type": "list", "min_items": 2},
"stakeholders_identified": {"type": "string"}
}

# AI submits evidence
evidence = {
"requirements_documented": ["Req1", "Req2", "Req3"],
"acceptance_criteria": ["AC1", "AC2"],
"stakeholders_identified": "Product team"
}

# Validation
result = validate_checkpoint(checkpoint, evidence) # ✅ Pass

Why dynamic checkpoints:

  • Checkpoints defined in docs (easy to update)
  • No code changes to modify workflow gates
  • Each workflow can have custom evidence requirements
  • Supports spec-driven workflows (checkpoints in tasks.md)

Standards System Design

The Language Brittleness Problem

Background: Documentation is typically:

  1. Language-agnostic (too vague): "Use proper synchronization"
  2. Language-specific (too narrow): "Use Python's threading.Lock()"

Neither scales:

  • Vague docs → AI guesses → 40% error rate
  • Specific docs → Only works for one language → Rewrite for each language

Example: Race conditions exist in Python, Go, Rust, JavaScript, Java, C++, C#...

  • Universal principle: "Multiple threads, shared state, at least one write"
  • Python-specific: threading.Lock(), asyncio.Lock(), GIL behavior
  • Go-specific: sync.Mutex, channels, goroutine safety, -race detector
  • Rust-specific: Mutex<T>, Arc, ownership rules prevent most races

Design Decision: Universal + Generated

Decision: Write once (universal), generate many (language-specific).

Architecture:

📄universal/standards/concurrency/race-conditions.md
(Timeless CS fundamentals, 2-3 pages)
LLM generation during install
(Analyzes project: language, frameworks, patterns)
.praxis-os/standards/development/python-concurrency.md
(Python-specific: threading, asyncio, GIL, pytest)

Universal Standard Structure Example:

Race Conditions (Universal)

Definition
Race condition occurs when:

  1. Multiple execution contexts access shared state
  2. At least one performs a write operation
  3. Timing affects the result

Detection

  • Non-deterministic behavior
  • "Works on my machine" bugs
  • Sporadic test failures

Prevention

  • Mutual exclusion (locks)
  • Atomic operations
  • Immutable data structures
  • Message passing (actor model)

Testing

  • Run tests under load (10x parallelism)
  • Use race detectors
  • Property-based testing with random scheduling

Generated Python-Specific Standard Example:

Race Conditions (Python)

Python-Specific Concerns

GIL (Global Interpreter Lock)

  • Protects Python object access
  • Does NOT prevent race conditions
  • Switching happens between bytecode instructions

Threading Module

import threading

counter = 0
lock = threading.Lock()

def increment():
global counter
with lock: # Acquire lock
counter += 1 # Critical section

Asyncio Module

import asyncio

counter = 0
lock = asyncio.Lock()

async def increment():
global counter
async with lock:
counter += 1

Testing with pytest

import pytest
import threading

def test_race_condition():
counter = 0
lock = threading.Lock()

def worker():
nonlocal counter
for _ in range(1000):
with lock:
counter += 1

threads = [threading.Thread(target=worker) for _ in range(10)]
for t in threads: t.start()
for t in threads: t.join()

assert counter == 10000 # Would fail without lock

Generation process:

  1. Detect project language (pyproject.toml, package.json, go.mod, etc.)
  2. Analyze frameworks (FastAPI, Django, Express, Gin, etc.)
  3. Read universal standard (timeless principles)
  4. Generate language-specific (syntax, idioms, stdlib, frameworks)
  5. Cross-reference (link back to universal)

Trade-offs:

BenefitLimitation
Write once, apply everywhereGeneration requires LLM call (one-time)
Consistent principles across languagesGenerated content may need human review
Tailored to actual project frameworksRegenerate when project changes significantly
Timeless universal standardsLanguage-specific may age (Python 3.8 → 3.13)

Alternatives Considered:

  1. Manual language-specific docs

    • Why not: 8 languages × 50 standards = 400 docs to maintain
    • Cost: Unsustainable, docs drift, inconsistencies
  2. Pure universal (no language-specific)

    • Why not: AI guesses syntax, 40% error rate
    • Example: "Use a lock" → AI guesses wrong import, wrong syntax
  3. LLM training data only

    • Why not: Not project-specific, outdated, no new patterns
    • Example: AI trained on Python 3.7, your project uses 3.12 features
  4. IDE-specific (JetBrains, VSCode extensions)

    • Why not: Ties docs to tooling, not portable
    • Goal: Work with any AI assistant (Cursor, Continue, Copilot)

Three-Tier File Architecture

The Attention Quality Problem

Research finding: LLM attention quality degrades with file size:

  • Under 100 lines: 95%+ attention quality
  • 200-500 lines: 70-85% attention quality
  • Over 500 lines: Below 70% attention quality

Traditional workflow file: 2000 lines, all phases and tasks in one file

  • Context use: 90% of model's window
  • Attention quality: Under 70%
  • Success rate: 60-65%

Design Decision: Horizontal Decomposition

Decision: Break workflows into three tiers by consumption pattern, not abstraction.

Tier 1

Execution Files (≤100 lines)

workflows/spec_creation_v1/phases/1/
📄task-1-business-goals.md65 lines
📄task-2-user-stories.md78 lines
📄task-3-functional-requirements.md92 lines
📄task-4-nonfunctional-requirements.md88 lines
Purpose: Immediate execution guidance
Consumed: Every time task runs
Contains: Commands, steps, examples, acceptance criteria
Size: ≤100 lines (optimal attention)
Tier 2

Methodology Files (200-500 lines)

workflows/spec_creation_v1/core/
📄srd-template.md320 lines
📄specs-template.md280 lines
📄tasks-template.md210 lines
Purpose: Comprehensive guidance (referenced, not always loaded)
Consumed: When explicitly requested or first-time context
Contains: Methodology, principles, complete templates
Size: 200-500 lines (acceptable for active reading)
Tier 3

Output Files (Unlimited)

.praxis-os/specs/2025-10-12-user-auth/
README.mdGenerated (3KB)
srd.mdGenerated (15KB)
specs.mdGenerated (25KB)
tasks.mdGenerated (8KB)
Purpose: AI-generated artifacts
Consumed: By humans or AI for reference
Contains: Specs, designs, implementation plans
Size: Unlimited (generated output)

Why three tiers:

  1. Different consumption patterns → Different size requirements
  2. Execution files loaded every task → Must be tiny
  3. Methodology files loaded once → Can be larger
  4. Output files never fully loaded → Unlimited

Comparison:

ApproachFile SizeContext UseAttention QualitySuccess Rate
Monolithic (old)2000 lines90%Under 70%60-65%
Three-tier (new)30-100 lines15-25%95%+85-95%

Trade-offs:

  • Pro: 3-4x improvement in success rate
  • Pro: Minimal context overhead
  • Con: More files to maintain (30 vs 1)
  • Con: Navigation complexity (mitigated by phase.md navigation)

Design Decisions Summary

DecisionProblem SolvedTrade-off
MCP + RAGContext overflow (50KB → 5KB)Indexing overhead (60s first run)
Semantic chunkingMaintain context integrityVariable chunk sizes
Local embeddingsCost, latency, privacySlightly lower quality than API
Architectural phase gatingAI skips phases (68% → 0%)No phase parallelization
Evidence-based checkpointsQuality gates bypassedExplicit checkpoint design needed
Universal + generated standardsLanguage brittlenessOne-time LLM generation call
Three-tier file architectureAttention degradationMore files to maintain
File watcher for auto-reindexStale index~50ms filesystem overhead

Performance Characteristics

RAG Engine

MetricValueTargetStatus
Query latency (p50)32msUnder 50ms
Query latency (p95)82msUnder 100ms
Index build time52sUnder 60s
Memory footprint~200MBUnder 500MB
Throughput~22 qpsOver 10 qps

Workflow Engine

MetricValue
State save latency3-8ms
State load latency2-5ms
Checkpoint validation1-2ms
Phase transitionUnder 10ms

Overall Impact

MetricBeforeAfterImprovement
Context per query50KB+2-5KB90% reduction
Relevant content4%95%24x improvement
Messages neededBaseline-71%Behavioral change
Cost per monthBaseline-54%Even with +59% model cost
Success rate70%92%31% improvement
Phase skipping rate68%0%100% elimination

Key insight: Technical efficiency (90% context reduction per query) enables behavioral change (71% fewer messages), which drives cost reduction (54%). See Economics for full analysis.


Deployment Model

Per-Project Installation

Decision: Each project gets its own prAxIs OS copy.

Structure:

📁my-project/
📂.praxis-os/
📁ouroboros/# Complete server copy
📁universal/# Standard library
📁standards/# Generated for this project
📁workflows/# Available workflows
📁specs/# Project-specific specs
📁state/# Workflow state
📁.cache/# Vector index
📄pyproject.toml# Project dependencies
📁src/# Project code

Why per-project, not global:

  1. Version control: Different projects can use different prAxIs OS versions
  2. Customization: Tailor standards and workflows per project
  3. Isolation: No cross-contamination between projects
  4. Portability: Git clone includes prAxIs OS setup
  5. Team consistency: Everyone uses same version

Trade-offs:

  • Pro: Complete project isolation and customization
  • Con: ~50MB per project (mostly MCP server Python code)
  • Con: Updates must be run per project (mitigated by praxis_os_upgrade_v1 workflow)

Alternative: Global installation

  • Why not: Version conflicts, no per-project customization
  • Example: Project A needs old workflow, Project B needs new workflow → conflict