Market Comparison Service Implementation Plan

March 11, 2025 · 12 min read

Architect

########### Claude ############

Current Limitations

The current MarketRankingService has evolved over time through various experiments, resulting in a complex implementation with multiple comparison strategies and supporting methods. This complexity makes it difficult to maintain and extend the service, and it also introduces inconsistencies in the comparison results.

Some key issues with the current implementation:

The service name (MarketRankingService) doesn't accurately reflect its primary purpose
Multiple public methods with overlapping functionality
Inconsistent comparison results between different strategies
Complex implementation with many private helper methods
Lack of clear separation of concerns
Mistakes in market comparison that make tournament results less informative

New Service Design

The new MarketComparisonService will focus exclusively on comparing two markets, with a clean and simple API that hides the complexity of the implementation.

Core API

@Injectable()
export class MarketComparisonService {
  /**
   * Compare two markets based on the provided configuration
   * @param market1 First market to compare
   * @param market2 Second market to compare
   * @param config Configuration for the comparison
   * @returns Comparison result with winner, confidence, and reasoning
   */
  public async compareMarkets(
    market1: string,
    market2: string,
    config: ComparisonConfig,
  ): Promise<ComparisonResult> {
    // Implementation details hidden
  }
}

Key Design Principles

Single Responsibility: The service will focus exclusively on comparing two markets
Simple API: A single public method with a clear purpose
Encapsulation: All implementation details will be hidden behind the public API
Consistency: The service will provide consistent results regardless of the internal strategy used
Testability: The service will be designed for easy testing

Critical Success Factors

Based on the TYCHE FDD and experience with the current implementation, several factors have been identified as critical for effective market comparisons:

Comparison Flow Optimization
- Tailored to reasoning-focused LLM capabilities
- Structured to maximize model comprehension
- Adapted to specific model strengths/weaknesses
Data Presentation Format
- Technical data presented in digestible format
- Eliminates numerical comparison errors
- Highlights significant differences while maintaining context
Relative Signal Generation
- Presents indicators in relative terms (e.g., "RSI of market A is 10% higher than market B")
- Considers absolute values when relative comparisons would be misleading
- Reduces cognitive load on comparison models
Market Anonymization
- Hides actual market names to prevent bias
- Uses generic identifiers during comparison process
- Reveals actual markets only after comparison is complete
Multi-turn Comparison Process
- Guides through comparison with atomic questions
- Builds analysis through conversational approach
- Improves reasoning quality through step-by-step process
Responsibility Separation
- Separates reasoning generation from JSON data extraction
- Allows optimization of each process independently
- Improves overall system reliability
Weak Market Pre-filtering
- Eliminates obviously weak markets before comparison
- Uses weighted scoring system for pre-selection
- Prevents pollution of final results with poor candidates
Prompt Engineering
- Carefully crafted sorting prompts ("spells")
- Sometimes less structured prompts work better than explicit instructions
- Requires ongoing experimentation and refinement

########## Max ##########

raw ideas and observations (max is brainstoriming):

methodology of our process here: reverse engineer prev solution (market ranking service) to assumptions, assumptions to questions, convert questions to set of experiments, find truth and make design decision
R1 showed correct (and pretty cool) reasoning but somehow return result that not aligned with thinking (aspecially for both weak). possible solution: multi turn comparison should allow to eliminate weak once it get to condition which allow to make simple binary answer. we should be able to catch that moment and dont allow to move further
test strong hypothesis first to save time. what are strong ideas in this list and can contribute significant value?
mb provide definitions of week and strong in some special form? for example market considered weak if opposite action (sell instead of buy) will lead to higher profit? both strong = not weak AND tied. then we need to define ties
is llm realy can make meaningful comparison wherer there are contradicting signals? its pretty hard even for human. then what value llm can provide in such cases (strong skill)
if Max cannot explain swing buy how Max can teach llm to do that reliably?
mb we just need to eliminate stupid mistakes and implement flow where we can improve specific part easily and udnerstand if it worked based on data (not just guessing). it was the problem with legacy comparison flow - it was implementation of one idea which was not designed for such experimenting. now we can try to split full process into atomic parts, and try to understand weakness and strong side of llm withing that atom. it will allow to design experiments with precision. also we can understand how strong property of llm can be converted to valuable peace of data withing that atom (like we gathering some knowledge about market space with such flow as byproduct)
mb instead of asking binary questions we should ask top10 explanations supporting some statement. then separatly validate each argument and simply count them - it will use reasoning power of llm. it will generate reasons instead of decisions.
mb we should allow llm to zoom out if needed. and allow to zoom in for consistency. also it will expand capabilities of comparison engine when we allow to move in and out of timeframes. will it very hard to build such universal system?
try multi turn with just ollama service (we really need to try it to understand how it works within ollama)
provide as little data as possible and allow to call tools (can be very powerful thing and definetly can be utilized in other parts of the system)
if tool calling is possible can we also introduce way how to use each tool right in place where we provide it. so llm can disc over needed context on its own. its kinda make model to learn during conversation which maybe will make small model smart as big one since all necesary knowledge can be discovered
converting dollar values like volume to words form will incr quality of comparison (sometimes model treat it as 'unit'). this seems very minor detail that should not have significant effect on our conecpt. even if its contributing to bad results its pretty easy to fix. so skip this idea for now since its very limited and not source of problem neither a solution.
is llm can compare markets without making per market analysis (maybe true only for multi turn, without multiturn from thinking section i can tell that llm doing some sort of market analysis first, its just happening implicit).
huge structed prompt maybe cool bcs provide detailed framework assuming that llm will follow it but in same time it hides full structure beneeth thinking tokens. so it both makes really strong assumption about llm capability to follow that structured prompt and hides from us very important data so we have no idea where model really fucked up (like NOW)
Is making model specialized on ONE market condition can make it better, kind of define model mindset. why fundamentally it may be beneficial (for humans its probably same case of value of mindset)
why 7b is fundamentaly better (it was confirmed by o1 analysis). mb because of better guessing where is does not understand what is the purpose of that request/question?
can we rely on llm being able to compare 2 numbers. we should confirm it reliably bcs its fundamental part of any solution that we going to implement. we just cannot afford not to rely on such simple assumption.
json that we provide in one line makes request size a bit shorter but it may affect quality (should be checked but feels pretty minor).
queality of data should be fixed with highest priority since cost of such mistake (compared to how easy it can be avoided) is enourmous in terms of produced encertainty for us to understand what went wrong. if we can provide less data to make decision we should do it
stronger is relative term. but 'both weak' is absolute. mixing those in one request leads to natural controversy?
reasoning actualy looks very good. it looks like model is capable but during flow something specific goes wrong which leads to wrong final result
interface idea: market comparison takes TA data as input (not market names) unless we allow llm to call tools
should we implement explicit method to get market analysis via thinking tokens only (to verify correctness of thought process)?
how to make R1 1.5B to be as smart as 7B and higher? hypothesis: multi-turn flow with reasoning
instead of targeting and hunting for weaknesses we should establish strong parts that we 100% sure we can rely on and build system based purely on it. there should be ton of weaknesses and its normal. targeting those is just waste of time. if we find solution based on strong parts which solve our problem will we look for weaknesses that we did not aaddress? definerly NO. its kinda black hole for energy - once we go in that direction it will suck all our resources.

########## Claude ##########

I'll review your brainstorming list and provide feedback on the ideas for improving the MarketComparisonService. This is a great collection of thoughts that covers many aspects of the problem.

Initial Feedback on Your Ideas

Methodology & Process

Reverse engineering approach (item #1) is excellent - starting with assumptions, converting to questions, then experiments. This scientific method will help identify the real issues.
Testing strong hypotheses first (item #3) is a time-efficient approach. We should prioritize ideas that could provide the most significant value.

LLM Capabilities & Limitations

Reasoning vs. Results mismatch (item #2) - This is a critical observation. The model shows good reasoning but sometimes returns results that don't align with its own analysis. Multi-turn comparison could help catch these inconsistencies.
LLM's ability to handle contradicting signals (item #5) - Valid question about whether LLMs can make meaningful comparisons with contradictory signals. This is challenging even for humans.
Numerical comparison reliability (item #18) - This is a fundamental capability we need to verify. If LLMs struggle with basic numerical comparisons, we need to address this first.

Architecture & Design Approaches

Atomic process decomposition (item #7) - Breaking the process into atomic parts to understand strengths/weaknesses is an excellent approach. This modular design would make experimentation and improvement much easier.
Reasoning generation vs. decision making (item #8) - Having the LLM generate reasons rather than decisions could leverage its reasoning power better. Counting validated arguments is an interesting approach.
Tool calling capabilities (items #11-12) - Allowing the LLM to call tools and discover needed context could be powerful. This could make smaller models perform better by giving them access to necessary information.
Multi-turn comparison with Ollama (item #10) - Testing multi-turn capabilities within Ollama is important to understand its practical limitations.
Timeframe flexibility (item #9) - Allowing the LLM to zoom in/out of timeframes could enhance comparison capabilities, though it might increase complexity.

Data Presentation & Processing

Data quality and quantity (item #20) - Prioritizing data quality and providing only necessary data is crucial. This reduces uncertainty in understanding what went wrong.
Structured prompts vs. visibility (item #15) - Interesting observation about how structured prompts might hide important thinking processes. Finding the right balance is key.
Relative vs. absolute terms (item #21) - The distinction between relative terms ("stronger") and absolute terms ("both weak") could be causing confusion in the model's reasoning.

Implementation Ideas

Interface design (item #23) - Taking TA data as input rather than market names is a clean separation of concerns, unless we implement tool calling.
Explicit reasoning verification (item #24) - Implementing a method to verify the correctness of thought processes could help identify where reasoning breaks down.
Focus on strengths, not weaknesses (item #26) - Building on reliable capabilities rather than trying to fix all weaknesses is a pragmatic approach. This could save significant time and resources.

Key Patterns & Insights

Looking across your ideas, I see several important themes:

Reasoning vs. Results Disconnect: The models show good reasoning but sometimes return inconsistent results.
Process Decomposition: Breaking the comparison into smaller, verifiable steps could improve reliability.
Data Quality & Presentation: How we present data to the model significantly impacts performance.
Multi-turn Potential: Using conversational approaches could improve reasoning quality.
Leveraging Strengths: Building on what works reliably rather than fixing all weaknesses.

Next Steps Recommendation

Based on your brainstorming, I would recommend:

Verify fundamental capabilities: Test if LLMs can reliably compare numerical values and understand market indicators (addressing item #18).
Implement a modular, multi-turn approach: Break the comparison process into atomic steps that can be individually verified and improved (addressing items #7, #10).
Separate reasoning from decision-making: Have the LLM generate reasons, then use a separate process to validate and make decisions (addressing item #8).
Experiment with data presentation: Test different ways of presenting market data to find what works best (addressing items #13, #19, #20).
Build a prototype that focuses on strengths: Implement a solution that relies on the capabilities we're confident in (addressing item #26).

########## Max ##########

as actionable items firther will be set of simple but extensive experiment.
verify fundamental capabilities. first of all we need to confirm that we can really rely on simple number comparison. bcs in gist i've seen pretty stupid mistakes in reasoning section which led me doubt in ability to compare 2 numbers (specificly in context of RSI). so we need to confirm that my observation was wrong. if it was legit then we dont need to go any further in this service since we cannot rely on any numerical related results. in this experiiment we will generate a lot of small artificial prompt where we say something like 'market rsi is 39. if rsi is below 40 its oversold, if rsi above 70 its overbought. is market oversould and overbought'. we need to confirm that both reasoning and results are correct. and geneterate many tests based on few templates. verify that llm handle all cases reliably. and if we find some edge cases where its not true we need to understand what we should avoid and finally reply to ourself is that simple and fundamental tool is available for us. i expect it to be available at least to point where we have clear understanding what we should avoid in prompt to make it robust as fuck
if 2 shows us that we good to go further then lets experiment with multi turn approach. it can be game changer and solve basicly all problems by providign some robust and extremely reliable tool.
if 3 confirmed as true or at least we undeerstood capabiities of multi turn and realize that its indeeed foundation that we can use to build complex stuff on top then next important experiment would be to check tool calling capabilities bcs it opens way to build universal solution with atomic blocks. if we success with that universal approach it will open enourmous opportunities for complex analysis and also it will have use cases much further this particular context. so that experiment is very important for mercury capabilties not limited with tyche use case. once we conclude these 3 experiments we can reassess our next steps withing market service

####### Claude #########

I agree with your experimental approach. Let me add some specific details for each experiment:

Number comparison experiment:
- Create 20-30 test cases with varying RSI values (35, 39, 40, 41, 69, 70, 71, etc.)
- Include edge cases (exactly 40, exactly 70)
- Test with different phrasings ("below 40" vs "less than 40" vs "under 40")
- Test with different indicator types (RSI, MACD, Volume)
- Document success rate and any patterns in failures
- Expected outcome: Establish reliable patterns for numerical comparisons
Multi-turn approach experiment:
- Start with simple market analysis questions
- Progress to guided comparison with follow-up questions
- Test different conversation structures: a) Analysis → Comparison → Decision b) Question-based exploration → Summary → Decision
- Compare results with single-turn approach using same data
- Expected outcome: Quantify improvement in reasoning consistency
Tool calling capabilities:
- Create simple tools for fetching specific indicator data
- Test if model can appropriately decide when to use tools
- Measure impact on reasoning quality with/without tools
- Expected outcome: Framework for atomic, tool-based market analysis

These experiments should give us clear data on fundamental capabilities we can rely on before implementing the full service.

########## Max ##########

on fundamental market number comparison capabilities:
- lets not waste our time on full test converage from the beginning. lets add tests in few steps. the idea behind such approach is if we can confirm problem in first steps then we dont need to go ant further - this tool is not available in this form for us. so firstly we eliminate risk of wasting time on everything else. basicly we want basic proof that its worth it to spend time on this. if this tool in unreliable bs we just need one solid confirmation. what we are going to confirm for basic number comparison: 1. reasoning ok. 2. json response ok 3. handling common numerical values ok 4. handling scientific representation ok. 5. hadnling equals ok 6. same set for json results. reasoning confirmation a bit tricky since we want to run many tests and verification should be automated (mb save reasonign in separate file and then analyze it with bigger model or other llama3.2). so basicly we are just looking for solid fuckup with proof or confirmation that we can move forward. at this test there is no point to change rsi for macd. or generate many tests ourself besides generating them from simple template. for json response it will be much simplier to verify. also we going to experiment with bs detection system on this simple task.
multy turn flow: fistly we need to understand that tool without market context. just pure experimenting witg ollama service and gradually add complexity once we confrm we understand that tool
same for tool calling. we will start with universal stuff since we are looking eo extend our capabilites beyond market context. so fist tools that we provide should be not limited with market context. like get availanle tools. get help on tool. and decide which t0ool to continue with

########## Claude ########## Your incremental testing approach makes perfect sense. Let me refine the experimental plan: Fundamental number comparison test (minimal version): Create 5-10 simple test cases focusing on basic numerical comparison Test structure: "Value X is Y. If Y is [comparison] Z, then condition is true/false." Include one case each for: basic integers, decimals, scientific notation, equality check Automate verification by extracting JSON response and checking against expected result For reasoning verification: extract key phrases or use a simple classifier model Goal: Quick go/no-go decision on whether to proceed further Multi-turn flow (starting with non-market context): Begin with simple logical reasoning tasks (e.g., "I'm thinking of a number...") Test basic memory/context retention across 3-5 turns Measure consistency of reasoning across turns Only after confirming basic capabilities, introduce simple market concepts Goal: Understand Ollama's multi-turn limitations before adding domain complexity Tool calling exploration (domain-agnostic first): Start with meta-tools: help, list_tools, tool_description Test basic tool selection logic with simple scenarios Measure appropriate tool selection rate Goal: Establish baseline tool-calling reliability before market-specific implementation This approach minimizes wasted effort by establishing clear go/no-go criteria at each step. If we encounter fundamental limitations, we can pivot quickly rather than investing in a complex implementation that won't work.

Current Limitations​

New Service Design​

Core API​

Key Design Principles​

Critical Success Factors​

raw ideas and observations (max is brainstoriming):

Initial Feedback on Your Ideas​

Methodology & Process​

LLM Capabilities & Limitations​

Architecture & Design Approaches​

Data Presentation & Processing​

Implementation Ideas​

Key Patterns & Insights​

Next Steps Recommendation​

Current Limitations

New Service Design

Core API

Key Design Principles

Critical Success Factors

Initial Feedback on Your Ideas

Methodology & Process

LLM Capabilities & Limitations

Architecture & Design Approaches

Data Presentation & Processing

Implementation Ideas

Key Patterns & Insights

Next Steps Recommendation