- Published on
PerOXO
- Authors
- Name
- Sidharth Singh
- https://x.com/sid10singh
Welcome to PerOXO β an actor-based, real-time messaging backend written in Rust, designed to handle high-concurrency WebSocket communication with strong guarantees around performance, safety, and scalability.
PerOXO is not a monolith. It is a microservices-driven system built around Tokioβs async runtime, Axum WebSockets, gRPC, and message-passing concurrency, eliminating shared mutable state entirely.
At its core, PerOXO enables direct messaging and room-based chats, with durable persistence, asynchronous processing, and production-grade observability.
What Makes PerOXO Different?
Most real-time chat systems rely on locks, shared state, or heavy frameworks. PerOXO instead embraces:
- Pure actor-model architecture
- Lock-free concurrency
- Message-passing only (no shared mutable state)
- Separation of concerns via microservices
- Observability-first design
Each logical unit runs in its own async task, owns its own state, and communicates only through Tokio channels.
Why Actor Model Over Shared Mutable State
A real-time WebSocket server manages mutable state that multiple connections need to interact with: the online user registry, per-user outbound channel maps, and room membership maps. The naive approach wraps each of these in Arc<Mutex<T>> and shares them across all connection handler tasks. This creates several concrete problems at scale.
Lock contention scales with connection count: Every WebSocket handler task contends on the same lock. With N concurrent connections, the probability of contention grows as O(N). Under the Tokio multi-threaded runtime, worker threads stall waiting for a lock held by a task that may itself be descheduled -- the classic async-mutex priority inversion problem.
Atomic operations constrain the optimizer: Even std::sync::Mutex in an uncontended case executes an atomic compare-and-swap (CAS) with a lock prefix on x86, which prevents the CPU from reordering surrounding loads and stores. The optimizer cannot fold, elide, or reorder any memory operations across the lock boundary. In a hot loop processing thousands of messages per second, these invisible constraints accumulate.
Deadlock risk under composition: When multiple locks must be acquired together (e.g. checking the user map while also modifying the room map), lock ordering discipline is required to prevent deadlocks. This discipline is enforced only by convention, not by the type system.
RefCell is not an option: Rc<RefCell<T>> is !Send, meaning it cannot exist inside a future that crosses an .await point on a multi-threaded runtime. Tokio's default #[tokio::main] is multi-threaded, so RefCell is a non-starter for any state that persists across awaits.
What This Looks Like In Practice
Here is how the user registry would look with shared mutable state:
struct SharedState {
users: Mutex<HashMap<TenantUserId, mpsc::Sender<ChatMessage>>>,
online_users: Mutex<Vec<TenantUserId>>,
rooms: Mutex<HashMap<String, Mutex<HashMap<TenantUserId, mpsc::Sender<ChatMessage>>>>>,
}
async fn handle_connection(state: Arc<SharedState>, socket: WebSocket, user_id: TenantUserId) {
let (tx, mut rx) = mpsc::channel(100);
{ let mut users = state.users.lock().await; users.insert(user_id.clone(), tx); }
{ let mut online = state.online_users.lock().await; online.push(user_id.clone()); }
while let Some(Ok(Message::Text(text))) = socket.next().await {
let msg: ChatMessage = serde_json::from_str(&text).unwrap();
let users = state.users.lock().await; // lock on every single message
if let Some(recipient_tx) = users.get(&msg.to) {
let _ = recipient_tx.try_send(msg);
}
}
}
Every message delivery locks the user map. With 10,000 concurrent connections, that is 10,000 tasks contending on a single Mutex<HashMap>. Room state requires nested locks -- a Mutex<HashMap<_, Mutex<HashMap>>> -- which is both an ergonomic nightmare and a deadlock risk.
In PerOXO, the MessageRouter owns all of this state privately:
pub struct MessageRouter {
pub receiver: mpsc::UnboundedReceiver<RouterMessage>,
pub users: HashMap<TenantUserId, mpsc::Sender<ChatMessage>>,
pub online_users: Vec<TenantUserId>,
pub rooms: HashMap<String, mpsc::UnboundedSender<RoomMessage>>,
}
No Arc, no Mutex, no RefCell. Plain HashMap and Vec with direct &mut self access. The sequential processing of the actor's mpsc receiver guarantees that self.users is never accessed concurrently -- mutual exclusion through sequential message processing rather than locks.
Core Architecture
PerOXO is designed using the Actor Model, structured around three mailbox actors and one shared service. Each actor isolates state and handles specific responsibilities to ensure high concurrency and fault tolerance.
1. MessageRouter Actor
Role: The Central Hub
The MessageRouter acts as the singleton orchestrator for the application. It processes RouterMessage variants sequentially from a single mpsc::UnboundedReceiver, managing the global state of the chat server and routing messages between sessions.
- State (all plain types, no locks):
users: HashMap<TenantUserId, mpsc::Sender<ChatMessage>>-- active connection registry.online_users: Vec<TenantUserId>-- presence tracking.rooms: HashMap<String, mpsc::UnboundedSender<RoomMessage>>-- room actor handles.persistence: Option<Arc<PersistenceService>>-- shared persistence handle (feature-gated).
- Key Responsibilities:
- Lifecycle Management: Handles
RegisterUser(connect) andUnregisterUser(disconnect). - Routing:
- Direct Messages: Looks up the recipient's
Sender<ChatMessage>and callstry_send-- a directHashMap::get, no lock acquisition. - Room Messages: Forwards to the appropriate
RoomActorvia its unbounded sender.
- Direct Messages: Looks up the recipient's
- Room Lifecycle: Spawns new
RoomActortasks on first join, stores their sender handles. - Query Handling: Responds to presence, room member, pagination, and sync requests via embedded
oneshotchannels.
- Lifecycle Management: Handles
Registration is a direct insert on &mut self:
pub async fn handle_register_user(
&mut self,
tenant_user_id: TenantUserId,
sender: mpsc::Sender<ChatMessage>,
respond_to: oneshot::Sender<Result<(), String>>,
) {
if self.users.contains_key(&tenant_user_id) {
let _ = respond_to.send(Err("User already online".to_string()));
return;
}
self.users.insert(tenant_user_id.clone(), sender);
self.online_users.push(tenant_user_id.clone());
let _ = respond_to.send(Ok(()));
}
No lock acquisition, no atomic CAS, no memory ordering constraints.
2. UserSession Actor
Role: Connection Handler Scope: One Actor per WebSocket connection
The UserSession represents a specific client connection. It wraps the raw WebSocket stream and bridges the gap between the client and the MessageRouter.
- Ownership:
- Identity: Holds the
TenantUserId(Project ID + User ID). - Transport: Owns the WebSocket stream (split into sink/stream).
- Channels: A bounded
mpsc::Sender<ChatMessage>(buffer 100) registered with the router, plus anack_sender/ack_receiverpair for persistence acknowledgments.
- Identity: Holds the
- Async Event Loops (two
tokio::spawnsubtasks):- Send Loop:
tokio::select!onsession_receiverandack_receiver-- serializesChatMessageandMessageAckvariants into WebSocket frames. - Receive Loop: Reads WebSocket text frames, deserializes JSON into
ChatMessagevariants, and dispatches to router viaRouterMessagewith embeddedoneshotchannels for ack flow.
- Send Loop:
- Lifecycle: On disconnect, sends
UnregisterUserto the router and drops all owned resources. Registration uses aoneshotfor the router's ack before accepting traffic.
3. RoomActor
Role: Room-scoped Message Hub Scope: One Actor per chat room, spawned on first join
The RoomActor is a per-room mailbox actor that isolates room membership and broadcasting from the central router. Each room gets its own async task with its own mpsc::UnboundedReceiver<RoomMessage>.
- State (plain types, no locks):
members: HashMap<TenantUserId, mpsc::Sender<ChatMessage>>-- direct handles to member sessions.persistence: Option<Arc<PersistenceService>>-- shared persistence handle (feature-gated).
- Message Loop: A
tokio::select!loop multiplexes message handling and periodic cleanup in the same task:
pub async fn run(mut self) {
let mut cleanup_interval =
tokio::time::interval(std::time::Duration::from_secs(60));
loop {
tokio::select! {
message = self.receiver.recv() => {
match message {
Some(msg) => self.handle_message(msg).await,
None => break,
}
}
_ = cleanup_interval.tick() => {
self.members.retain(|_, sender| !sender.is_closed());
}
}
}
}
- Broadcasting: Iterates
self.membersand callstry_sendon each sender -- no lock, no TOCTOU races. Closed or full channels are handled gracefully per-member. - Stale Cleanup: Every 60 seconds,
retaindrops members whose channels have closed, eliminating the need for explicit leave notifications on disconnect.
4. PersistenceService
Role: Data Reliability & Storage Integration: gRPC β chat-service β ScyllaDB
The PersistenceService is not a mailbox actor. It is an Arc-wrapped shared service with async methods, invoked from spawned tasks by the MessageRouter and RoomActor. This design was an intentional simplification -- persistence does not need its own mailbox and sequential processing because each write is independent.
pub struct PersistenceService {
pub chat_service_client: ChatServiceClient<Channel>,
}
- How It Decouples I/O: When the router or a room actor needs to persist a message, it
tokio::spawns an async task that callsPersistenceServicemethods. The actor's main loop never blocks on database latency. - Reliability Logic:
- Retry Mechanism: Write helpers implement retry logic for failed gRPC calls.
- Failure Isolation: If the database is down, live message delivery (RAM-to-RAM via channels) continues uninterrupted. Persistence acknowledgments report
Failedstatus throughoneshotchannels.
- Ack Flow: Spawned persistence tasks send
MessageAckResponseback through aoneshotsender, which theUserSessionforwards to the client as aMessageAckWebSocket frame. - Philosophy: "Fire and Forget" for the hot path, but "Eventually Consistent" for storage.
Where Arc Is Still Used (And Why That's Fine)
Arc appears in PerOXO in two categories, neither of which involves mutable shared state.
Sharing immutable handles across Axum handlers. PerOxoState wraps ConnectionManager in Arc and passes it as Axum State. ConnectionManager is immutable after construction -- it holds only an mpsc::UnboundedSender, which is Clone + Send + Sync and internally handles its own synchronization. The Arc here is reference counting to share this handle across handler futures. Cost: one atomic increment per handler invocation, negligible compared to the syscalls involved in accepting a WebSocket connection.
Sharing read-only database resources. In chat-service, the Scylla Session and prepared Queries are initialized once at startup and never mutated. Arc distributes them across gRPC handler tasks. No Mutex wrapping, no contention.
The PersistenceService itself is wrapped in Arc and shared between the MessageRouter and spawned persistence tasks, but it holds only a gRPC client handle -- no mutable fields, no synchronization needed.
Microservices Ecosystem
| Service | Responsibility | Tech |
|---|---|---|
| per-oxo | WebSocket gateway, metrics endpoint | Axum, Tokio, Prometheus |
| chat-service | Chat persistence (DMs, rooms, pagination) | gRPC, ScyllaDB, tonic |
| auth-service | Authentication, tenant management, Google OAuth | Axum, gRPC, PostgreSQL, Redis |
| rabbit-consumer | Async message processing | RabbitMQ, ScyllaDB |
| prometheus | Metrics collection | Prometheus |
| grafana | Metrics visualization & dashboards | Grafana |
| redis | Token storage, rate limiting | Redis |
| scylladb | Message persistence | ScyllaDB |
Observability
PerOXO ships with production-grade observability built in.
Prometheus metrics are exposed at /metrics on the gateway. The metrics.rs module registers counters, gauges, and histograms covering:
http_requests_total/http_request_duration_seconds-- HTTP request counts and latency by method, path, and status.websocket_connections_active-- real-time gauge of open WebSocket connections.websocket_messages_total-- counters by direction (sent, received, persisted).chat_messages_total/chat_message_processing_duration_seconds-- chat-level processing metrics by message type.grpc_requests_total/grpc_request_duration_seconds-- gRPC call tracking by service, method, and status.db_query_duration_seconds-- database query latency by operation.
An Axum middleware layer instruments every HTTP request automatically. A pre-configured Grafana dashboard (grafana_dashboard.json) is included in the repository, and both Prometheus and Grafana run as services in docker-compose.yaml.
Structured logging via tracing and tracing-subscriber is used across all services, with environment-controlled log levels (RUST_LOG).
End-to-End Message Flow
The following sequence diagram illustrates how a single WebSocket message flows through the system β from client input, through routing, broadcasting, persistence, and back to connected users.

Multi-Tenancy Model & Identity Design
PerOXO implements multi-tenancy through a layered authentication and authorization system that ensures strict tenant isolation at the protocol level.
1. Tenant Isolation Strategy
PerOXO uses a dual-identifier approach to guarantee isolation between different organizations or projects hosting on the platform.
- Tenant Identity: Defined by a
project_id(format:peroxo_pj_{nanoid}) and asecret_api_key. - User Identity: Users are identified not just by their ID, but by the combination of their
project_idanduser_id.
This structure ensures that a user ID 123 in Project A is mathematically distinct from user ID 123 in Project B.
2. How Tokens Encode Tenant Context
Identity is not inferred; it is explicitly encoded into the authentication token itself.
- Token Format: Tokens are opaque strings with a pxtok_ prefix followed by 24 alphanumeric characters.
- Payload (Redis): The token resolves to a JSON payload in Redis that carries the full tenant context:
pub struct UserToken {
pub project_id: String, // Tenant identifier
pub user_id: String, // User within tenant
pub expires_at: u64, // Time-limited access
}
This ensures that every authenticated session carries an immutable tenant binding.
3. Tenant Provisioning via Google OAuth
Tenant creation is gated behind Google OAuth. The GET /generate-tenant endpoint requires a valid Google ID token in the Authorization: Bearer header. The auth-service validates the token against Google's tokeninfo endpoint, verifying the aud claim matches the configured GOOGLE_CLIENT_ID.
A Redis-backed rate limiter restricts tenant generation to 3 requests per Google sub (subject) per hour, using a fixed-window strategy keyed on rl:gen_tenant:{sub}. This prevents abuse while keeping the provisioning flow stateless on the application side.
4. Preventing Cross-Tenant Message Leakage
Cross-tenant access is prevented through multiple validation checkpoints before a user can ever send or receive a message:
- Token Generation Guard: The Auth Service validates the
project_idagainst thesecret_api_keystored in PostgreSQL before issuing a token. If the keys do not match, no token is created. - WebSocket Handshake Gate: Before a WebSocket connection is upgraded, the handler extracts the token and calls the Auth Service via gRPC (
VerifyUserToken) to verify it. - Strict Verification: The system verifies that the token exists, is valid, and has not expired. Only then is the
UserTokenobject (containing theproject_id) returned to the gateway.
5. Router-Level Enforcement
Multi-tenancy is enforced before the traffic reaches the routing logic.
- The MessageRouter operates on sessions that have already been strictly authenticated.
- By the time a RegisterUser message reaches the router, the system has already guaranteed that the connection belongs to a valid tenant.
- This separation of concerns allows the Router to focus on high-performance message passing without performing redundant database lookups for every packet.
6. Why This Matters
This design positions PerOXO as enterprise-grade collaboration infrastructure rather than a simple chat server:
- Enterprise Isolation: Each tenant operates in a secure silo with cryptographic separation.
- Scalable Multi-Tenancy: A single service instance can serve multiple organizations safely.
- Audit Trail: All actions are traceable to specific project_id + user_id combinations.
- Security: Short-lived tokens (600 seconds) minimize credential exposure window.