Using Phi-3.5 in Our Infrastructure

January 1, 2025 · 3 min read

Architect

This guide provides a comprehensive overview of using Phi-3.5 in our infrastructure, including server setup, configuration, and best practices for efficient deployment.

Available Servers

GPU Server (arcana-gpu)

Hardware specifications:

RTX 3060 with 12GB VRAM
Previous-generation CPU
64GB RAM

Primary use: TON Arcana project deployment

CPU Server (mercury)

Hardware specifications:

New generation CPU
64GB RAM
No GPU

Primary use: Mercury project deployment

Project-Specific Requirements

TON Arcana Requirements

Features: AI-powered Tarot readings
Performance: Real-time responses (under 3 seconds)
Load: Supports multiple concurrent users
Personality: Creative, consistent outputs
Context Window: Medium (2-3K tokens)

Mercury Requirements

Features: Market analysis and reports
Performance: High analytical accuracy with structured outputs
Batch Processing: Capability required
Context Window: Large (4K+ tokens)

Model Details

Core Architecture

Type: Transformer-based small language model (SLM)
Variants: Mini, MoE, and Vision (multimodal)
Parameters:
- Mini: 3.8B parameters
- MoE: 42B parameters (activates 6.6B during inference)
Languages: Over 20 languages supported
Context Window: Up to 128K tokens (mini variant optimized for 2-3K tokens)

Memory Usage

Mini Variant: Approximately 8GB for inference
MoE Variant: Approximately 12GB VRAM (not ideal for RTX 3060)
CPU Usage: Memory-efficient with LoRA tuning options

Quantization Support

Available options: FP16, INT8
Recommended: INT8 for CPU deployments

Key Characteristics

Advantages:

Efficient inference with reduced memory usage
High versatility across tasks
Extended context capabilities

Limitations:

Limited GPU scalability for larger models
Requires fine-tuning for niche applications

Hardware Performance

RTX 3060 Performance

Handles mini variant efficiently
Peak VRAM usage: Approximately 10GB
Concurrent sessions: 8-10 with batching

CPU Server Performance

Efficient with INT8 quantization
Lower throughput than GPU
Ideal for Mercury's batch processing

Memory Patterns

GPU Server:

Steady VRAM usage: Approximately 8GB for mini
Temperature: Around 70°C under load
Power draw: Approximately 180W

CPU Server:

RAM usage: Peaks at 15GB, averages 12GB
Temperature: Below 70°C
Concurrent requests: 4 with batching

Implementation Guide

TON Arcana Setup

Configuration:

Use Phi-3.5-mini variant
Enable FP16 quantization
Optimize for real-time responses

Example tarot reading prompt:

You are a wise Tarot reader. Interpret the following card spread in a mystical and insightful tone: [Card Details].

Implementation example:

const { Ollama } = require('ollama-client');
const client = new Ollama({
  model: 'phi3.5-mini',
  server: 'http://localhost:11434',
});

async function getTarotReading(cards) {
  const prompt = `Interpret these cards: ${cards}`;
  const response = await client.generate(prompt);
  return response.data;
}

Mercury Setup

Configuration:

Use Phi-3.5-mini or MoE variant
Enable INT8 quantization
Optimize for batch processing

Example market analysis prompt:

Analyze the following market data and provide actionable insights: [Market Data].

Implementation example:

from phi3_5 import Model

def analyze_market(data):
    model = Model(server='http://localhost:11434', model='phi3.5-mini')
    prompt = f"Analyze this market data: {data}"
    response = model.generate(prompt)
    return response

Performance Metrics

Response Times

GPU Server: Under 2.5 seconds per request
CPU Server: 4-5 seconds per request

Token Processing

GPU: Approximately 100 tokens per second
CPU: Approximately 50 tokens per second

Resource Usage

GPU Memory: Stable at 8GB
CPU Memory: 12-15GB
Concurrent Requests: Up to 10 (GPU) / Up to 4 (CPU)

Model Comparison

Context Window Size

Phi-3.5: 128K tokens (mini)
Llama-3.1: 8K tokens
Mistral-7B: 16K tokens

Performance Profile

Phi-3.5: Balanced performance
Llama-3.1: High compute needs
Mistral-7B: Moderate requirements

Ease of Integration

Phi-3.5: Simple API interface
Llama-3.1: Standard complexity
Mistral-7B: Standard complexity

Deployment Instructions

Initial Setup

curl -sSL https://ollama.com/install | bash
ollama serve --model=phi3.5-mini

Resource Planning

GPU Server: Reserve 8GB VRAM
CPU Server: Allocate 4 cores minimum

Monitoring

Use Prometheus for metrics
Track: VRAM usage, response times, concurrent requests

Failover Strategy

Implement request retry logic
Use CPU server as backup
Cache frequent responses
Maximum retry attempts: 3 times

Best Practices

Always preload models before peak usage
Monitor VRAM usage continuously
Implement proper error handling
Use batching for multiple requests
Cache common responses
Regular performance monitoring
Set appropriate timeout limits (30 seconds maximum)
Implement graceful degradation under high load

This configuration ensures optimal performance for both TON Arcana and Mercury projects while maintaining reliability and efficiency.

Available Servers​

GPU Server (arcana-gpu)​

CPU Server (mercury)​

Project-Specific Requirements​

TON Arcana Requirements​

Mercury Requirements​

Model Details​

Core Architecture​

Memory Usage​

Quantization Support​

Key Characteristics​

Hardware Performance​

RTX 3060 Performance​

CPU Server Performance​

Memory Patterns​

Implementation Guide​

TON Arcana Setup​

Mercury Setup​

Performance Metrics​

Response Times​

Token Processing​

Resource Usage​

Model Comparison​

Context Window Size​

Performance Profile​

Ease of Integration​

Deployment Instructions​

Initial Setup​

Resource Planning​

Monitoring​

Failover Strategy​

Best Practices​