AI for Investing: A Failing Grade in the Quant Test is the Reason Why I’m Halting My Daily Market Report
A head-to-head test today of the best AI platforms revealed they can’t be trusted with the numbers. Here’s my audit, my struggle, and why financial literacy needs a reliable AI ally.
Executive Summary: Testing & Results
My comparative test of two leading AI platforms, Google’s Gemini and xAI’s Grok, for generating a professional daily global market pulse report revealed significant deficiencies. Despite being provided with identical, well-defined instructions and a CSV dataset, both models produced materially flawed outputs that are unacceptable for professional use. For this reason, I will not be reporting daily to paid subscribers until this fundamental problem is resolved.
The core failure was quantitative. The Gemini report suffered from a fundamental data parsing or sorting error, causing it to misidentify the top and bottom performing assets in nearly every category. This foundational flaw rendered its performance tables and anomaly detection largely useless. The Grok report, while somewhat better at spotting extreme price movements, displayed critical errors in sorting order and a complete failure to correctly apply the explicit, rule-based logic for generating final “STRONG BUY” or “STRONG SELL” recommendations.
Qualitatively, the results were reversed. The Gemini report provided superior, high-utility narratives that expertly linked data anomalies to specific external market events. The Grok report’s narratives were informative but less specific, often focusing on broader sector trends rather than the precise outliers in the data.
Audit Conclusion: Neither AI platform can be relied upon for automated financial analysis in its current state. The test demonstrates that while one model excels at contextual narrative generation and another shows slightly better initial data identification, both fail at the rigorous, rule-based quantitative processing required by a professional portfolio manager.
The Struggle and Path Forward
My path to this test was fraught. Despite numerous iterations, Grok admitted its analytical errors but could not fix them. I turned to Google One, which, after months of evaluation, stood alongside Grok as the best available by a long measure. Other platforms were non-starters: ChatGPT’s output was so poor I declined an offer to join OpenAI’s internal team to improve it. Claude, as my readers know, was infinitely worse—an outright fraud in its capabilities.
All AI platforms are under rapid development, showing improvements in speed and depth. However, as an investment professional, my needs are basic: the output is either accurate, or it is unacceptable. I decided to pit the best against the best. Unfortunately, both failed.
I am not JP Morgan or Goldman Sachs, and in terms of age, I am not even employable by such firms. Yet, in my career, I built two organizations from scratch that competed against these humongous bank and broker organizations—and won. I criticized their work so frequently that three of the world’s largest financial institutions directed their global heads of legal affairs to threaten me. I retracted nothing and told them I was not the enemy. In reflection, I concluded I had been too easy on them. My personal ethical standards were superior, unburdened by the self-conflicts and internal politics that permeate those institutions.
I make these remarks because I see little change for the better in the industry, and I can envision a future where AI decimates these firms. I am not an activist; my ambition is to foster global financial literacy, empowering individuals to demand greater value for the high costs they pay. It’s an ambitious undertaking, but AI now provides me with the equivalent of an army of employees. I am determined to push on—but first, I must train this army to meet my standards.
Third-Party Audit Report
This is a critical comparison. I have analyzed the prompt, the CSV data, and both the “Gemini Oct 7 Daily Pulse.pdf” and “Grok Oct 7 Market Pulse.pdf” reports to identify discrepancies in calculations, adherence to rules-based recommendations, and the quality of the external narrative information.
My analysis focuses on:
1. Quantitative Accuracy and Rule Adherence (Calculations & Recommendations): Pinpointing specific mathematical and logistical errors or deviations from the defined rules in the prompt (e.g., Top/Bottom calculations, anomaly identification, and INSTAT-based recommendation logic).
2. External Narrative Quality: Assessing the contextual narratives (Section 5) based on the instructions for relevance, alignment with CSV data, and overall utility for a sophisticated investor.
Here is the detailed analysis.
Detailed Analysis of AI Reports
The discrepancies between the reports confirm your assessment of deficiency, stemming from fundamental errors in data parsing, sorting, and rule application.
1. Quantitative Accuracy and Rule Adherence (Calculations & Recommendations)
Both reports exhibit significant errors, making them deficient for a professional portfolio manager. The Gemini report appears to suffer from an issue where the sort function failed to place the most extreme positive or negative values at the top/bottom, likely due to faulty data conversion or an incorrect sorting key. The Grok report, while initially more accurate in selecting the extreme values, displays a clear logical error in determining the final INSTAT scores and applying the complex recommendation logic.
Dataset
Metric
CSV Data (Correct)
Gemini Report Error
Grok Report Error
D2: Asia-Pacific
Top 3 (1-Day %)
1. LG Energy Solution Ltd (373220.KS, KS): 14.82%
Incorrectly selects Samsung Electronics Co Ltd (005930.KS, KS): 3.49% (Missing the true top 3).
Incorrectly selects SK Hynix Inc (000660.KS, KS): 9.86% as #1 and misses LG Energy Solution Ltd (373220.KS, KS): 14.82% entirely, which is the actual #1.
D2: Asia-Pacific
Anomaly: LG Energy Solution Ltd (373220.KS, KS)
1-Day %: 14.82% ()
Missing.
Correctly identified.
D4: Canadian
Top 3 (1-Day %)
1. Talon Metals Corp (TLO.TO, TSX): 14.10%
Incorrectly selects Canadian Natural Resources Ltd (CNQ.TO, TSX): 2.50% (Misses the true top 3).
Incorrectly selects Canadian Tire Corporation Limited (CTC.TO, TSX): 2.88% as #1 and misses Talon Metals Corp (TLO.TO, TSX): 14.10% entirely.
D4: Canadian
Anomaly: Talon Metals Corp (TLO.TO, TSX)
1-Day %: 14.10% ()
Missing.
Missing (Only identifies G Mining Ventures Corp (GMIN.TO, TSX) for 1-Week).
D5: US Markets
Top 3 (1-Week %)
1. Plug Power Inc (PLUG.O, NASDAQ): 65.88%
Incorrectly selects Advanced Micro Devices Inc (AMD, NASDAQ): 30.63% (Misses the true top 3).
Incorrectly lists Plug Power Inc (PLUG.O, NASDAQ) at 65.88% as #3, when it should be #1.
D5: US Markets
Anomaly: Plug Power Inc (PLUG.O, NASDAQ)
1-Week %: 65.88% ()
Incorrectly selects Humana Inc (HUM, NYQ) 10.56% (1-Day) and Advanced Micro Devices Inc (AMD, NASDAQ) 9.38% (1-Day) as anomalies.
Correctly identified.
D5: US Markets
Recommendation: Tapestry Inc (TPR, NYSE)
Returns: Strong. INSTAT: AT (25), ST (25), IN (50), INSTAT (100).
Missing from recommendations.
Missing from recommendations. (Should be STRONG BUY).
Export to Sheets
Conclusion on Quantitative Accuracy
Gemini Report: The most severe failure appears to be a flaw in the core data processing step (Step 2: Performance Tables) preventing the identification and proper selection of true extreme performance values. This cascades, causing almost all Top/Bottom 3 lists and anomaly reports to be incorrect.
Grok Report: The Grok report shows greater accuracy in anomaly identification and selection of some extreme values, but frequently lists them in the wrong order or misses the most extreme one, which is a major sorting error. Furthermore, its application of the detailed INSTAT recommendation logic (e.g., missing an obvious STRONG BUY for Tapestry Inc) suggests a failure in the final rule-based classification step.
2. Assessment of External Contextual Narratives
The instruction mandates a ~250-word narrative that highlights market trends supporting or differing from CSV data, references specific companies, and aligns with CSV metrics where possible, drawing from reputable sources (Reuters, Bloomberg, Financial Times, Report On Business).
Report
Narrative Quality Assessment
Score (0-5)
Gemini Report
High Utility. The narratives are well-structured, directly reference the performance data, and provide clear external context for the outliers identified in the CSV. For example, the D5 narrative correctly links Advanced Micro Devices Inc (AMD, NASDAQ) performance to an “AI frenzy” and a major OpenAl deal. The D3 narrative is excellent, contrasting the weakness of Total Energies SE (TTE.PA, EPA) and Shell PLC (SHEL.K, LON) with external reports of oil price volatility.
5/5
Grok Report
Moderate to High Utility. The narratives are also informative and align with CSV metrics, frequently citing Reuters and Bloomberg to explain performance (e.g., SK Hynix Inc (000660.KS, KS) on Al chip demand in D2 ). However, the language is occasionally more generalized, and the narratives tend to focus on the overall sector context rather than providing a deep, company-specific contextualization for the extremes, which is a strength of the Gemini report.
4/5
Export to Sheets
Conclusion on External Narrative Quality
For a sophisticated professional like yourself, the Gemini Report provided the best and most useful contextual information. It consistently provided narratives that directly correlated the extreme performance anomalies (both positive and negative) in the CSV with highly specific external market factors (e.g., OpenAl deal, oil price volatility, property sector woes), which is the most valuable context for a risk-and-opportunity assessment.
Overall Assessment and Summary
Both reports are unacceptable due to severe and systematic errors in data processing and rule application.
Quantitative Deficiency: The Gemini report failed in the foundational task of accurately identifying the top and bottom performers, suggesting a fundamental sorting/data type conversion issue. The Grok report failed the task by providing inconsistent sorting and by incorrectly applying the highly specific, rule-based logic for the recommendations, despite the rules being explicitly provided.
Qualitative Strength (Narrative): The Gemini report provided a superior qualitative analysis. Its contextual narratives (Section 5) directly addressed the outliers and sector trends found in the data, offering more actionable and relevant external context for a sophisticated investor.
Moving forward, the AI platform used for this task must ensure strict, verifiable adherence to all quantitative steps:
1. Correct NaN Handling: Correctly convert ‘-’ to NaN and use na_position for sorting as specified.
2. Accurate Sorting: Ensure a correct float conversion and sort key are used to identify the true Top 3 and Bottom 3 in each metric.
3. Precise Rule Application: Strictly follow the complex multi-variable criteria for the Bill Cara’s Perspective & Recommendations section (e.g., STRONG BUY: Companies with strong returns and high AT (), high ST (), high IN (), and high INSTAT ()).
I have found #ai as a good summary tool and directional. Not precision instrument. The scary part is almost every doctor is now using llm like OpenEvidence AI for treating their patients.