Monitoring & Observability

Overview

Effective monitoring is crucial for a real-time trading platform like Opinix Trade. This guide covers monitoring strategies, tools, and best practices.

Monitoring helps detect issues before they impact users and provides insights for optimization.

Key Metrics to Monitor

Application Metrics

Request rate & latency
Error rates
WebSocket connections
Queue depth & processing time

Business Metrics

Active orders
Trade volume
User activity
Order matching latency

Infrastructure Metrics

CPU & memory usage
Database connections
Redis memory
Network throughput

User Experience

Page load time
Time to interactive
WebSocket latency
Error rates by endpoint

Logging Strategy

Opiinix Trade includes a shared @opinix/logger package used across services.

Logger Configuration

The logger package likely uses Winston based on the dependencies:

import { logger } from '@opinix/logger';

// Log levels: error, warn, info, http, verbose, debug
logger.info('Order placed', { 
  orderId: '123', 
  userId: 'user456',
  eventId: 'event789'
});

logger.error('Failed to process order', { 
  error: err.message,
  stack: err.stack,
  orderId: '123'
});

Structured Logging

Best Practices
Log Levels
Context Injection

Always use structured logging with context:

// Good - Structured
logger.info('Order matched', {
  orderId: order.id,
  price: order.price,
  quantity: order.quantity,
  side: order.side,
  eventId: order.eventId,
  userId: order.userId,
  matchTime: Date.now()
});

// Bad - Unstructured
logger.info(`Order ${order.id} matched at ${order.price}`);

Benefits:

Easier to search and filter
Better for log aggregation
Machine-readable
Enables analytics

Use appropriate log levels:

// ERROR - Critical issues requiring immediate attention
logger.error('Database connection failed', { error });

// WARN - Issues that should be investigated
logger.warn('Queue depth exceeds threshold', { depth: 1000 });

// INFO - Important business events
logger.info('Order placed', { orderId, eventId });

// HTTP - HTTP requests (via morgan middleware)
// Automatically logged by morgan

// DEBUG - Detailed diagnostic information
logger.debug('Processing order', { order });

Add request context to all logs:

// Express middleware
app.use((req, res, next) => {
  req.logger = logger.child({
    requestId: req.id,
    userId: req.user?.id,
    ip: req.ip
  });
  next();
});

// Use in routes
app.post('/order', (req, res) => {
  req.logger.info('Order received', { order: req.body });
});

Application Performance Monitoring (APM)

Recommended Tools

Sentry - Error Tracking

Sentry provides real-time error tracking and performance monitoring.Installation:

npm install @sentry/node @sentry/nextjs

Server Setup (Express):

import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 1.0,
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Express({ app })
  ]
});

// Request handler must be first
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());

// Routes...

// Error handler must be last
app.use(Sentry.Handlers.errorHandler());

Client Setup (Next.js):

// sentry.client.config.js
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: 1.0,
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0
});

New Relic - Full APM

Comprehensive monitoring for Node.js applications.Installation:

npm install newrelic

Configuration:

// newrelic.js
exports.config = {
  app_name: ['Opinix Trade Server'],
  license_key: process.env.NEW_RELIC_LICENSE_KEY,
  distributed_tracing: { enabled: true },
  logging: { level: 'info' }
};

Usage:

// Must be first import
import 'newrelic';
import express from 'express';
// ... rest of imports

Datadog - Infrastructure & APM

Full-stack monitoring with infrastructure metrics.Installation:

npm install dd-trace

Setup:

// Must be imported first
import tracer from 'dd-trace';
tracer.init({
  service: 'opinix-trade-server',
  env: process.env.NODE_ENV,
  logInjection: true
});

Infrastructure Monitoring

Database Monitoring

PostgreSQL Metrics
Prisma Monitoring

Monitor key PostgreSQL metrics:

-- Active connections
SELECT count(*) FROM pg_stat_activity;

-- Long running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '1 minute';

-- Database size
SELECT pg_size_pretty(pg_database_size('repo'));

-- Table sizes
SELECT tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Cache hit ratio (should be > 90%)
SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
FROM pg_statio_user_tables;

Key Metrics:

Connection pool usage
Query execution time
Cache hit ratio
Slow queries
Lock contention
Replication lag

Add Prisma query logging:

import { PrismaClient } from '@prisma/client';

const prisma = new PrismaClient({
  log: [
    { emit: 'event', level: 'query' },
    { emit: 'event', level: 'error' },
    { emit: 'event', level: 'warn' }
  ]
});

prisma.$on('query', (e) => {
  logger.debug('Query executed', {
    query: e.query,
    params: e.params,
    duration: e.duration,
    target: e.target
  });

  // Alert on slow queries
  if (e.duration > 1000) {
    logger.warn('Slow query detected', {
      query: e.query,
      duration: e.duration
    });
  }
});

prisma.$on('error', (e) => {
  logger.error('Prisma error', { error: e });
});

Redis Monitoring

Redis Metrics
BullMQ Monitoring

Monitor Redis with INFO command:

# Connect to Redis
redis-cli INFO

# Key metrics
redis-cli INFO stats
redis-cli INFO memory
redis-cli INFO clients

# Monitor commands in real-time
redis-cli MONITOR

# Check queue depth (BullMQ)
redis-cli LLEN bull:order-queue:wait
redis-cli LLEN bull:order-queue:active
redis-cli LLEN bull:order-queue:failed

Key Metrics:

Memory usage
Connected clients
Operations per second
Cache hit ratio
Evicted keys
Queue depths

Monitor BullMQ queues:

import { Queue } from 'bullmq';

const queue = new Queue('order-queue', {
  connection: { host: 'localhost', port: 6379 }
});

// Get queue metrics
async function getQueueMetrics() {
  const [waiting, active, completed, failed, delayed] = await Promise.all([
    queue.getWaitingCount(),
    queue.getActiveCount(),
    queue.getCompletedCount(),
    queue.getFailedCount(),
    queue.getDelayedCount()
  ]);

  logger.info('Queue metrics', {
    waiting,
    active,
    completed,
    failed,
    delayed
  });

  // Alert on high queue depth
  if (waiting > 100) {
    logger.warn('High queue depth', { waiting });
  }

  return { waiting, active, completed, failed, delayed };
}

// Run every minute
setInterval(getQueueMetrics, 60000);

Real-time Monitoring Dashboard

Grafana Setup

Install Grafana

docker run -d -p 3001:3000 --name=grafana grafana/grafana

Add Data Sources

Configure Prometheus, PostgreSQL, and Redis as data sources in Grafana.

Import Dashboards

Use pre-built dashboards:

Node.js Application Metrics
PostgreSQL Database
Redis
NGINX (if using as reverse proxy)

Create Custom Panels

Monitor Opinix Trade specific metrics:

Active orders by event
Order matching latency
WebSocket connections
Trade volume

Health Check Endpoints

Implement health check endpoints for all services:

Server Health Check
Liveness & Readiness

// apps/server/src/routes/health.ts
import { Router } from 'express';
import { PrismaClient } from '@prisma/client';
import Redis from 'ioredis';

const router = Router();
const prisma = new PrismaClient();
const redis = new Redis(process.env.REDIS_URI);

router.get('/health', async (req, res) => {
  const checks = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: 'unknown',
      redis: 'unknown'
    }
  };

  // Check database
  try {
    await prisma.$queryRaw`SELECT 1`;
    checks.checks.database = 'ok';
  } catch (error) {
    checks.checks.database = 'error';
    checks.status = 'degraded';
  }

  // Check Redis
  try {
    await redis.ping();
    checks.checks.redis = 'ok';
  } catch (error) {
    checks.checks.redis = 'error';
    checks.status = 'degraded';
  }

  const statusCode = checks.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(checks);
});

export default router;

// Kubernetes-style probes

// Liveness - Is the service running?
router.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness - Can the service accept traffic?
router.get('/health/ready', async (req, res) => {
  try {
    // Check critical dependencies
    await Promise.all([
      prisma.$queryRaw`SELECT 1`,
      redis.ping()
    ]);
    
    res.status(200).json({ status: 'ready' });
  } catch (error) {
    res.status(503).json({ status: 'not ready', error: error.message });
  }
});

Alerting

Alert Rules

Critical Alerts

Immediate attention required:

Service down (health check failing)
Database connection pool exhausted
Error rate > 5%
Redis out of memory
Queue depth > 1000 orders
Order matching latency > 1s

Warning Alerts

Should be investigated:

CPU usage > 80%
Memory usage > 85%
Slow queries (> 1s)
Queue depth > 500
WebSocket connection errors
Failed background jobs

Info Alerts

Informational:

Deployment completed
High trade volume
New event created
Scheduled maintenance

Alert Channels

Slack

Send alerts to Slack channels:

#alerts-critical
#alerts-warning
#monitoring

Email

Email notifications for critical issues to on-call team.

PagerDuty

Incident management and on-call scheduling.

Discord

Community notifications for status updates.

Performance Optimization

Database
Caching
Load Testing

Query Optimization:

// Bad - N+1 query
const events = await prisma.event.findMany();
for (const event of events) {
  const participants = await prisma.user.findMany({
    where: { events: { some: { id: event.id } } }
  });
}

// Good - Single query with include
const events = await prisma.event.findMany({
  include: { participants: true }
});

Indexing:

-- Add indexes for frequently queried fields
CREATE INDEX idx_event_status ON "Event"(status);
CREATE INDEX idx_event_slug ON "Event"(slug);
CREATE INDEX idx_user_phone ON "User"("phoneNumber");

Connection Pooling:

const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL + '?connection_limit=10'
    }
  }
});

Implement caching strategy:

import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URI);
const CACHE_TTL = 60; // seconds

async function getEventWithCache(eventId: string) {
  // Check cache first
  const cached = await redis.get(`event:${eventId}`);
  if (cached) {
    return JSON.parse(cached);
  }

  // Query database
  const event = await prisma.event.findUnique({
    where: { id: eventId },
    include: { participants: true }
  });

  // Store in cache
  await redis.setex(
    `event:${eventId}`,
    CACHE_TTL,
    JSON.stringify(event)
  );

  return event;
}

// Invalidate cache on update
async function updateEvent(eventId: string, data: any) {
  const event = await prisma.event.update({
    where: { id: eventId },
    data
  });

  // Invalidate cache
  await redis.del(`event:${eventId}`);

  return event;
}

Test performance with load testing tools:

# Install k6
brew install k6  # macOS
# or download from k6.io

# Create load test
cat > load-test.js << 'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 20 },
    { duration: '1m', target: 50 },
    { duration: '30s', target: 0 },
  ],
};

export default function () {
  const res = http.get('http://localhost:3001/api/events');
  check(res, { 'status was 200': (r) => r.status == 200 });
  sleep(1);
}
EOF

# Run load test
k6 run load-test.js

Next Steps

Deployment

Deploy to production

Contributing

Start contributing

Docker Setup

Learn Docker configuration

Environment Setup

Configure environment variables

Setup

Deployment

Contributing

​Overview

​Key Metrics to Monitor

Application Metrics

Business Metrics

Infrastructure Metrics

User Experience

​Logging Strategy

​Logger Configuration

​Structured Logging

​Application Performance Monitoring (APM)

​Recommended Tools

​Infrastructure Monitoring

​Database Monitoring

​Redis Monitoring

​Real-time Monitoring Dashboard

​Grafana Setup

​Health Check Endpoints

​Alerting

​Alert Rules

​Alert Channels

Slack

Email

PagerDuty

Discord

​Performance Optimization

​Next Steps

Deployment

Contributing

Docker Setup

Environment Setup

Overview

Key Metrics to Monitor

Logging Strategy

Logger Configuration

Structured Logging

Application Performance Monitoring (APM)

Recommended Tools

Infrastructure Monitoring

Database Monitoring

Redis Monitoring

Real-time Monitoring Dashboard

Grafana Setup

Health Check Endpoints

Alerting

Alert Rules

Alert Channels

Performance Optimization

Next Steps