Using Phi-3.5 in Our Infrastructure
· 3 min read
This guide provides a comprehensive overview of using Phi-3.5 in our infrastructure, including server setup, configuration, and best practices for efficient deployment.
Available Servers
GPU Server (arcana-gpu)
Hardware specifications:
- RTX 3060 with 12GB VRAM
- Previous-generation CPU
- 64GB RAM
Primary use: TON Arcana project deployment
CPU Server (mercury)
Hardware specifications:
- New generation CPU
- 64GB RAM
- No GPU
Primary use: Mercury project deployment
Project-Specific Requirements
TON Arcana Requirements
- Features: AI-powered Tarot readings
- Performance: Real-time responses (under 3 seconds)
- Load: Supports multiple concurrent users
- Personality: Creative, consistent outputs
- Context Window: Medium (2-3K tokens)
Mercury Requirements
- Features: Market analysis and reports
- Performance: High analytical accuracy with structured outputs
- Batch Processing: Capability required
- Context Window: Large (4K+ tokens)
Model Details
Core Architecture
- Type: Transformer-based small language model (SLM)
- Variants: Mini, MoE, and Vision (multimodal)
- Parameters:
- Mini: 3.8B parameters
- MoE: 42B parameters (activates 6.6B during inference)
- Languages: Over 20 languages supported
- Context Window: Up to 128K tokens (mini variant optimized for 2-3K tokens)
Memory Usage
- Mini Variant: Approximately 8GB for inference
- MoE Variant: Approximately 12GB VRAM (not ideal for RTX 3060)
- CPU Usage: Memory-efficient with LoRA tuning options
Quantization Support
- Available options: FP16, INT8
- Recommended: INT8 for CPU deployments
Key Characteristics
Advantages:
- Efficient inference with reduced memory usage
- High versatility across tasks
- Extended context capabilities
Limitations:
- Limited GPU scalability for larger models
- Requires fine-tuning for niche applications
Hardware Performance
RTX 3060 Performance
- Handles mini variant efficiently
- Peak VRAM usage: Approximately 10GB
- Concurrent sessions: 8-10 with batching
CPU Server Performance
- Efficient with INT8 quantization
- Lower throughput than GPU
- Ideal for Mercury's batch processing
Memory Patterns
GPU Server:
- Steady VRAM usage: Approximately 8GB for mini
- Temperature: Around 70°C under load
- Power draw: Approximately 180W
CPU Server:
- RAM usage: Peaks at 15GB, averages 12GB
- Temperature: Below 70°C
- Concurrent requests: 4 with batching
Implementation Guide
TON Arcana Setup
Configuration:
- Use Phi-3.5-mini variant
- Enable FP16 quantization
- Optimize for real-time responses
Example tarot reading prompt:
You are a wise Tarot reader. Interpret the following card spread in a mystical and insightful tone: [Card Details].
Implementation example:
const { Ollama } = require('ollama-client');
const client = new Ollama({
model: 'phi3.5-mini',
server: 'http://localhost:11434',
});
async function getTarotReading(cards) {
const prompt = `Interpret these cards: ${cards}`;
const response = await client.generate(prompt);
return response.data;
}
Mercury Setup
Configuration:
- Use Phi-3.5-mini or MoE variant
- Enable INT8 quantization
- Optimize for batch processing
Example market analysis prompt:
Analyze the following market data and provide actionable insights: [Market Data].
Implementation example:
from phi3_5 import Model
def analyze_market(data):
model = Model(server='http://localhost:11434', model='phi3.5-mini')
prompt = f"Analyze this market data: {data}"
response = model.generate(prompt)
return response
Performance Metrics
Response Times
- GPU Server: Under 2.5 seconds per request
- CPU Server: 4-5 seconds per request
Token Processing
- GPU: Approximately 100 tokens per second
- CPU: Approximately 50 tokens per second
Resource Usage
- GPU Memory: Stable at 8GB
- CPU Memory: 12-15GB
- Concurrent Requests: Up to 10 (GPU) / Up to 4 (CPU)
Model Comparison
Context Window Size
- Phi-3.5: 128K tokens (mini)
- Llama-3.1: 8K tokens
- Mistral-7B: 16K tokens
Performance Profile
- Phi-3.5: Balanced performance
- Llama-3.1: High compute needs
- Mistral-7B: Moderate requirements
Ease of Integration
- Phi-3.5: Simple API interface
- Llama-3.1: Standard complexity
- Mistral-7B: Standard complexity
Deployment Instructions
Initial Setup
curl -sSL https://ollama.com/install | bash
ollama serve --model=phi3.5-mini
Resource Planning
- GPU Server: Reserve 8GB VRAM
- CPU Server: Allocate 4 cores minimum
Monitoring
- Use Prometheus for metrics
- Track: VRAM usage, response times, concurrent requests
Failover Strategy
- Implement request retry logic
- Use CPU server as backup
- Cache frequent responses
- Maximum retry attempts: 3 times
Best Practices
- Always preload models before peak usage
- Monitor VRAM usage continuously
- Implement proper error handling
- Use batching for multiple requests
- Cache common responses
- Regular performance monitoring
- Set appropriate timeout limits (30 seconds maximum)
- Implement graceful degradation under high load
This configuration ensures optimal performance for both TON Arcana and Mercury projects while maintaining reliability and efficiency.
