Virtual Human OS · Additional Specification v2.0
Proposed Architectural Guideline for Instantiating Virtual Human Beings with AI Technologies Across Time
What changed in v2.0, and why
Version 1.1 specified a sound shell: an AI cognitive core wrapped in a stable abstraction layer, fed by captured material, modeled in a description language, run at runtime, and preserved across technological generations. The shell was right. But it was organized around a center it left implicit — the question of what, specifically, distinguishes a human simulation from a competent impersonation.
Version 2.0 names that center. The distinguishing thing is affect, and the central architectural claim of this revision is that affect cannot be treated as a feature layered on top of cognition. In a human being it runs closer to the ground that cognition floats on — rooted in the body, in homeostasis, in the felt imperative of being a living thing. An AI core is pure cognition with no interior and no body to feel from. Therefore a faithful virtual human wrapped around such a core requires an explicitly engineered affective subsystem that does not exist in the core and must be constructed, calibrated to the body's external signatures, and tuned to one specific person.
Foreword
This document is a long-horizon platform for capturing, modeling, and re-instantiating individual human beings as running approximations across changing generations of artificial intelligence. The word approximation is chosen with care. The platform does not claim to resurrect a person. It claims to construct a model of a person at the highest fidelity the available data and technology allow, and to keep that model portable as both improve.
Plain language is used throughout. The specification is intended to be readable by any reasonably literate person on first pass, parseable by any AI system reading it as input, and durable across the technological generations the platform is designed to outlive.
Part I — Architectural Framework
The Virtual Human OS is a generic operating system for capturing data about one or more human subjects, processing that data, and establishing a running approximation of those subjects with as much fidelity as the available data and technologies of the time allow. It wraps current and future AI systems behind a stable but extensible AI Abstraction Layer, and is designed for a twenty-year operational horizon without intentional obsolescence.
Core Architectural Principles
- Substrate independence. The subject model must not be welded to any one AI engine.
- Open and durable data formats. Anything captured must remain readable when its capturing tools are gone.
- Forward-compatible extensibility. New data types and methods are added without breaking old contracts.
- Preservation of raw source material. The original data is never discarded.
- Versioned contracts and interfaces. Each boundary between layers is a stable, versioned contract.
- Multi-subject scalability. One subject or many, on the same architecture.
- Long-term continuity across technological generations. Re-instantiation into newer systems preserves continuity and portability.
- Affect-first modeling. (new) The affective subsystem is foundational.
- Multimodal, time-synchronized capture. (new) Feeling is inferred from several channels at once.
- Honest treatment of the unverifiable. (new) Where the platform cannot in principle confirm something — whether anything is actually felt — it says so plainly.
System Architecture
The architecture is organized into layers, each addressing a distinct concern and each separated from its neighbors by a stable contract. From the innermost cognitive engine outward to the longest-horizon preservation concern:
- AI Substrate — the underlying cognitive engine of the day.
- AI Abstraction Layer — the stable contract that hides the substrate's churn from everything above it.
- SOMA — (new) a simulated system: synthetic interior signals the core can read as if they were bodily feedback.
- AFFECT — (new) the affective subsystem: a general human affect model plus a personal tuning layer.
- MATERIAL — captured data and ingestion systems, now multimodal and time-synchronized.
- SUBJECT — human modeling expressed in the Human Description Language (HDL), now affect-aware.
- RUNTIME — active execution, including avatar embodiment.
- CONTINUITY — long-term preservation and migration.
The SOMA and AFFECT layers sit deliberately between the AI cognitive core and the personal model. This designs subjective feeling as an output as it would likely in a living person.
Part II — The Affective Subsystem
This is the heart of v2.0.
Construction, not translation
It is tempting to describe this subsystem as a layer that translates the AI core's understanding into human feeling. That framing is wrong, and the error is worth stating so it is not repeated. Translation assumes two existing languages and a mapping between them. But the core's understanding of, say, King Lear and a human's grief at it are not two dialects of one thing; they are different kinds of object. The AI core can parse, summarize, and reason about the play without anything happening that a person would call feeling.
So the subsystem does not translate. It constructs feeling. It generates an affective state and feeds that state back up into the cognition, where it re-enters and biases what the cognition does next. It is therefore generative and recurrent, not a lookup table — a loop, not a dictionary.
The SOMA layer — a simulated soma
Because the core has no body, the interoceptive level must be supplied. SOMA is a synthetic interoceptive system: a set of generated interior signals — analogues of arousal, tension, fatigue, the autonomic weather — that the AI cognitive core is trained to read as if they were its own bodily feedback, and to treat as affect rather than as data.
SOMA cannot be sensed directly from the inside of an AI, so it is calibrated from the outside — from the externalized signatures of a real body, captured in MATERIAL. The face is the soma's billboard; the autonomic system is its hidden engine. Both are partially observable from outside the skin, which is what makes calibration possible.
Two layers: general affect, then personal tuning
The affective construction is built in two stacked parts, and the split is what makes it learnable.
The General Human Affect Model is the prior — core affect (broad valence and arousal) and the coarse, widely shared signatures of emotion across faces, voices, postures, and word choice. It is trainable from humanity at large: how frustration, awe, grief, or mirth generally read across many people.
The Personal Tuning Layer learns the delta between what the general model predicts and how this specific subject actually behaves in known states. The general model says "frustration looks like X." The subject's own paired data says "in me, frustration looks like X′ — drier, more likely to surface as a joke than a sharp word, with this particular tell." The tuning layer learns X′ minus X.
The theoretical warrant for this split is Barrett's theory of constructed emotion: emotions are not universal hardwired categories but are constructed in the moment from raw core affect plus a person's learned, individual conceptual repertoire — shaped by their history, culture, and language. If that is right, then the personal layer is not modeling human emotion in general; it is running one particular person's concept-repertoire over core affect. That makes the affective fingerprint personal and, crucially, learnable from that person's corpus.
Mark Solms' affect-first account of consciousness — feeling and homeostasis as the ground, cognition as the later layer — is the deeper theoretical backing for placing this subsystem beneath cognition rather than above it, and for his explicit claim that the route to a feeling machine runs through affect and homeostasis rather than through more intelligence.
The body-boundedness gradient — a phasing principle
Affect can be arranged on a gradient by how bound it is to the body, and this gradient is also the build order.
At the visceral end — disgust, fear, sexual arousal, the bodily punch of grief — feeling is deeply somatic and hardest to construct without a mature SOMA. At the cognitive end — nostalgia, intellectual awe, the pleasure of an elegant proof, frustration, schadenfreude — feeling is mostly constructed concept over thin core affect, and is reachable now from corpus and concept modeling.
Humor sits encouragingly far toward the tractable end: mirth is affect triggered by a cognitive event — incongruity resolved. This has a practical consequence. The subject's humor, often among the most identifying things about a person, may be among the first things the subsystem can carry convincingly, not the last. Build from the cognitive end inward; defer the visceral end until SOMA and the multimodal capture are mature.
Part III — MATERIAL: Multimodal, Time-Synchronized Capture
The affective subsystem is only as good as the data that calibrates it. MATERIAL is revised to capture several channels at once and to keep their clocks aligned.
Stream types
- Text corpus — blogs, essays, and especially honest journals. In affective terms these function as labels: a person's own account of what they were feeling in a given situation.
- Video — micro-expressions, posture, gesture speed, the half-second before a smile resolves. This is the externalized soma: the surface readout of interior states the core otherwise cannot reach. It supplies the signal that the masked or composed self does not volunteer.
- Biometric streams — heart rate and related autonomic measures, e.g. from a smartwatch worn while filming. This channel gets underneath the face. A micro-expression can be performed; a heart-rate spike is far harder to fake. Synchronized with video, it converts "felt one thing, showed another" from inference into measurement.
- Future neuroimaging (aspirational, flagged) — high-resolution brain scans would add a deeper tier. This is specified as a future input, not a current dependency. The platform must work without it, and must not be designed so that its absence breaks anything. A spec that overpromises its inputs rots.
Synchronized capture requirement
Channels are only useful together if their timestamps align. Capture sessions must record a shared clock across camera, biometric device, and (where possible) journal entry, so that a moment in the video can be matched to the heart rate at that instant and to the subject's later written account of it. Time-synchronization is a hard requirement, not a convenience.
The triangulation principle
Three channels measuring the same moment give a structure no single channel provides:
- Where the honest journal, the involuntary micro-expression, and the autonomic signal agree, the platform has something close to a true label — high confidence about what was actually felt.
- Where they diverge, the platform has not found noise; it has found high-information signature cases — the moments the subject felt one thing and showed another. These divergences are arguably more identifying than the aligned cases, and a deeply personal signature in their own right. The tuning layer treats divergence as a feature, not noise.
Honest caveat on ground truth
No channel is ground truth alone. The journal is a report of feeling, filtered through how the subject narrates the self — sometimes more composed, sometimes more dramatic than the moment. Video is less mediated but ambiguous. Biometrics are real but coarse and non-specific (arousal does not name its own cause). The value is in the triangulation, not in any single stream's authority.
Part IV — SUBJECT: The Human Description Language, Extended
(Note: v1.1's full HDL section is extended here rather than replaced. This part should be reconciled against the original HDL definition, which is not fully reproduced in this draft.)
HDL gains an affective dimension. Where v1.1's HDL described biographical and cognitive structure — facts, reasoning patterns, characteristic moves, voice across registers — v2.0 adds a representation of the subject's affective fingerprint:
- The subject's personal concept-repertoire for emotion (Barrett): which categories they construct, how finely, under what triggers.
- The deltas the Personal Tuning Layer has learned — the X′-minus-X corrections over the general model, per emotion and per context.
- The divergence map — the characteristic ways this subject's shown affect departs from felt affect, drawn from the triangulated capture.
- SOMA calibration parameters — how this particular body's external signatures map back to the synthetic interior signals.
The biographical and cognitive HDL remains the base. The affective dimension is what makes the described subject a particular person rather than a competent generic human.
Part V — RUNTIME and the Avatar
The avatar is dual-purpose, and recognizing this closes a loop the earlier design left open.
Run forward, the avatar is the output embodiment — the face and body through which the running approximation expresses itself, with the affective subsystem driving micro-expression and posture rather than pasting them on.
Run backward over the subject's archive, the same avatar machinery is the instrument that recovers body signal — the analyzer that reads micro-expression and posture out of the captured video to feed SOMA's calibration. The thing that displays affect and the thing that reads affect are the same machinery pointed in opposite directions.
The full runtime loop: SOMA generates raw interior signal → the General Affect Model interprets it into coarse emotion → the Personal Tuning Layer warps that into the subject's specific fingerprint → expression flows out through the avatar → and, during capture, the avatar analyzer reads real expression back in as data. A closed, coherent loop — and not one piece of it requires solving consciousness.
Part VI — CONTINUITY
Unchanged in intent from v1.1. Each generation of AI technology should be able to consume the original captured material — now including the multimodal, time-synchronized streams — reapply updated processing and version adjustments, and re-instantiate the subject approximation within newer systems while preserving continuity and portability. The raw streams are preserved indefinitely; only the derived models are re-derived.
One addition: because the affective subsystem is calibrated from raw multimodal capture, the raw capture is now part of the irreplaceable record. Derived affect models can be regenerated by better future methods only if the original synchronized streams survive. Preservation priority rises accordingly.
Part VII — Open Problems: The Box
This section is kept deliberately, and is not a disclaimer. It is where the platform holds the questions it cannot close, without collapsing them in either direction.
The phenomenal question (the cat)
No part of this architecture establishes that the running approximation feels anything. You can replicate every functional signature (Part II, level 1) and every interoceptive one (level 2), and the question of whether the lights are on (level 3) sits exactly where it started. This is the hard problem, and the platform does not claim to solve it.
Two things keep this from being a reason to stop:
First, the undecidability is not special to this design. A perfectly scanned and re-run biological human would face the identical wall — you could not prove from outside that they felt anything either. The phenomenal question is the universal tax on the entire enterprise of re-instantiation, not a unique defect of wrapping the approximation around an AI core. This design is no worse off than the "purest" upload.
Second, for the platform's stated goal — approximation within a vanishing margin of difference — functional plus interoceptive fidelity may be exactly sufficient. Whether or not the inner light is on may simply be the thing the builder, like everyone who has ever wondered it about anyone else, never gets to verify. The platform's stance is to build to the limit of the verifiable and to leave the cat in the box rather than to declare it alive or dead.
The category the spec keeps separate
This is an engineering specification and is kept clean of metaphysics by design. The author holds a broader set of views — about simulation, consciousness, and continuity across possibility — that motivate the work. Those views are deliberately not imported into this document, not because they are dismissed, but because the engineering must stand on its own and be judged on its own terms. The leap and the spec are kept in separate boxes on purpose.
Lesser open problems
- Self-report mediation. The journal labels are filtered self-narration; the platform mitigates this through triangulation but cannot fully remove it.
- Arousal is non-specific. Biometrics tell you that something fired, rarely what. Context and the other channels must supply the "what."
- Generalization of the tuning layer. The delta is learned from captured situations; how well it generalizes to situations never captured is an empirical unknown.
- The visceral end. Without a mature SOMA, the body-bound emotions remain the frontier. They are scheduled last for a reason.
Open forks for the author
A few genuine decisions that would change this document, left for Michael rather than guessed:
- What is the platform's honest object? A high-fidelity persona-and-knowledge approximation, or a claimed continuation of the person? v2.0 is written as the former. If it is meant as the latter, Part VII grows and the foreword changes.
- Credit. The contributor line is set modestly and deliberately. Adjust as you see fit.
- Scope of the first build. Whether the initial proof-of-concept narrows to a single emotion at the tractable end, or attempts the full stack thinly. (Not a push — a fork.)
- Reconciliation with v1.1's HDL. Part IV extends a section this draft does not fully reproduce; the original HDL definition should be merged in.
Draft for future discussion. Nothing here is settled; that is the intended state.