Skip to main content

Using Phi-3.5 in Our Infrastructure

· 3 min read
Max Kaido
Architect

This guide provides a comprehensive overview of using Phi-3.5 in our infrastructure, including server setup, configuration, and best practices for efficient deployment.

Available Servers

GPU Server (arcana-gpu)

Hardware specifications:

  • RTX 3060 with 12GB VRAM
  • Previous-generation CPU
  • 64GB RAM

Primary use: TON Arcana project deployment

CPU Server (mercury)

Hardware specifications:

  • New generation CPU
  • 64GB RAM
  • No GPU

Primary use: Mercury project deployment

Project-Specific Requirements

TON Arcana Requirements

  • Features: AI-powered Tarot readings
  • Performance: Real-time responses (under 3 seconds)
  • Load: Supports multiple concurrent users
  • Personality: Creative, consistent outputs
  • Context Window: Medium (2-3K tokens)

Mercury Requirements

  • Features: Market analysis and reports
  • Performance: High analytical accuracy with structured outputs
  • Batch Processing: Capability required
  • Context Window: Large (4K+ tokens)

Model Details

Core Architecture

  • Type: Transformer-based small language model (SLM)
  • Variants: Mini, MoE, and Vision (multimodal)
  • Parameters:
    • Mini: 3.8B parameters
    • MoE: 42B parameters (activates 6.6B during inference)
  • Languages: Over 20 languages supported
  • Context Window: Up to 128K tokens (mini variant optimized for 2-3K tokens)

Memory Usage

  • Mini Variant: Approximately 8GB for inference
  • MoE Variant: Approximately 12GB VRAM (not ideal for RTX 3060)
  • CPU Usage: Memory-efficient with LoRA tuning options

Quantization Support

  • Available options: FP16, INT8
  • Recommended: INT8 for CPU deployments

Key Characteristics

Advantages:

  • Efficient inference with reduced memory usage
  • High versatility across tasks
  • Extended context capabilities

Limitations:

  • Limited GPU scalability for larger models
  • Requires fine-tuning for niche applications

Hardware Performance

RTX 3060 Performance

  • Handles mini variant efficiently
  • Peak VRAM usage: Approximately 10GB
  • Concurrent sessions: 8-10 with batching

CPU Server Performance

  • Efficient with INT8 quantization
  • Lower throughput than GPU
  • Ideal for Mercury's batch processing

Memory Patterns

GPU Server:

  • Steady VRAM usage: Approximately 8GB for mini
  • Temperature: Around 70°C under load
  • Power draw: Approximately 180W

CPU Server:

  • RAM usage: Peaks at 15GB, averages 12GB
  • Temperature: Below 70°C
  • Concurrent requests: 4 with batching

Implementation Guide

TON Arcana Setup

Configuration:

  • Use Phi-3.5-mini variant
  • Enable FP16 quantization
  • Optimize for real-time responses

Example tarot reading prompt:

You are a wise Tarot reader. Interpret the following card spread in a mystical and insightful tone: [Card Details].

Implementation example:

const { Ollama } = require('ollama-client');
const client = new Ollama({
model: 'phi3.5-mini',
server: 'http://localhost:11434',
});

async function getTarotReading(cards) {
const prompt = `Interpret these cards: ${cards}`;
const response = await client.generate(prompt);
return response.data;
}

Mercury Setup

Configuration:

  • Use Phi-3.5-mini or MoE variant
  • Enable INT8 quantization
  • Optimize for batch processing

Example market analysis prompt:

Analyze the following market data and provide actionable insights: [Market Data].

Implementation example:

from phi3_5 import Model

def analyze_market(data):
model = Model(server='http://localhost:11434', model='phi3.5-mini')
prompt = f"Analyze this market data: {data}"
response = model.generate(prompt)
return response

Performance Metrics

Response Times

  • GPU Server: Under 2.5 seconds per request
  • CPU Server: 4-5 seconds per request

Token Processing

  • GPU: Approximately 100 tokens per second
  • CPU: Approximately 50 tokens per second

Resource Usage

  • GPU Memory: Stable at 8GB
  • CPU Memory: 12-15GB
  • Concurrent Requests: Up to 10 (GPU) / Up to 4 (CPU)

Model Comparison

Context Window Size

  • Phi-3.5: 128K tokens (mini)
  • Llama-3.1: 8K tokens
  • Mistral-7B: 16K tokens

Performance Profile

  • Phi-3.5: Balanced performance
  • Llama-3.1: High compute needs
  • Mistral-7B: Moderate requirements

Ease of Integration

  • Phi-3.5: Simple API interface
  • Llama-3.1: Standard complexity
  • Mistral-7B: Standard complexity

Deployment Instructions

Initial Setup

curl -sSL https://ollama.com/install | bash
ollama serve --model=phi3.5-mini

Resource Planning

  • GPU Server: Reserve 8GB VRAM
  • CPU Server: Allocate 4 cores minimum

Monitoring

  • Use Prometheus for metrics
  • Track: VRAM usage, response times, concurrent requests

Failover Strategy

  • Implement request retry logic
  • Use CPU server as backup
  • Cache frequent responses
  • Maximum retry attempts: 3 times

Best Practices

  1. Always preload models before peak usage
  2. Monitor VRAM usage continuously
  3. Implement proper error handling
  4. Use batching for multiple requests
  5. Cache common responses
  6. Regular performance monitoring
  7. Set appropriate timeout limits (30 seconds maximum)
  8. Implement graceful degradation under high load

This configuration ensures optimal performance for both TON Arcana and Mercury projects while maintaining reliability and efficiency.