Why Large Language Models Struggle with Financial Analysis.Large language models revolutionized areas where text generation, analysis, and interpretation were applied. They perform fabulously with volumes of textual data by drawing logical and interesting inferences from such data. But it is precisely when these models are tasked with the analysis of numerical, or any other, more-complex mathematical relationships that are inevitable in the world of financial analysis that obvious limitations start to appear.
Let's break it down in simpler terms.
Problem in Math and Numerical Data Now, imagine a very complicated mathematical formula, with hundreds of variables involved. All ChatGPT would actually do, if you asked it to solve this, is not really a calculation in the truest sense; it would be an educated guess based on the patterns it learned from training.
That could be used to predict, for example, after reading through several thousand symbols, that the most probable digit after the equals sign is 4, based on statistical probability, but not because there's a good deal of serious mathematical reason for it. This, in short, is a consequence of the fact indicated above, namely that LLMs are created to predict patterns in a language rather than solve equations or carry out logical reasoning through problems. To put it better, consider the difference between an English major and a math major: the English major can read and understand text very well, but if you hand him a complicated derivative problem, he's likely to make an educated guess and check it with a numerical solver, rather than actually solve it step by step.
That is precisely how ChatGPT and similar models tackle a math problem. They just haven't had the underlying training in how to reason through numbers in the way a mathematics major would do.
Financial Analysis and Applying It
Okay, so why does this matter for financial analysis? Suppose you were engaging in some financial analytics on the performance of a stock based on two major data sets: 1) a corpus of tweets about the company and 2) movements of the stock. ChatGPT would be great at doing some sentiment analysis on tweets.
This is able to scan through thousands of tweets and provide a sentiment score, telling if the public opinion about the company is positive, negative, or neutral. Since text understanding is one of the major functionalities of LLMs, it is possible to effectively conduct the latter task.
It gets a bit more challenging when you want it to take a decision based on numerical data. For example, you might ask, "Given the above sentiment scores across tweets and additional data on stock prices, should I buy or sell the stock at this point in time?" It's for this that ChatGPT lets you down. Interpreting raw numbers in the form of something like price data or sentiment score correlations just isn't what LLMs were originally built for.
In this case, ChatGPT will not be able to judge the estimation of relationship between the sentiment scores and prices. If it guesses, the answer could just be entirely random. Such unreliable prediction would be not only of no help but actually dangerous, given that in financial markets, real monetary decisions might be based on the data decisions.
Why Causation and Correlation are Problematic for LLMs More than a math problem, a lot of financial analysis is really trying to figure out which way the correlation runs—between one set of data and another. Say, for example, market sentiment vs. stock prices. But then again, if A and B move together, that does not automatically mean that A causes B to do so because correlation is not causation. Determination of causality requires orders of logical reasoning that LLMs are absolutely incapable of.
One recent paper asked whether LLMs can separate causation from correlation. The researchers developed a data set of 400,000 samples and injected known causal relationships to it. They also tested 17 other pre-trained language models, including ChatGPT, on whether it can be told to determine what is cause and what is effect. The results were shocking: the LLMs performed close to random in their ability to infer causation, meaning they often couldn't distinguish mere correlation from true cause-and-effect relationships. Translated back into our example with the stock market, one might see much more clearly why that would be a problem. If sentiment towards a stock is bullish and the price of a stock does go up, LLM simply wouldn't understand what the two things have to do with each other—let alone if it knew a stock was going to continue to go up. The model may as well say "sell the stock" as give a better answer than flipping a coin would provide.
Will Fine-Tuning Be the Answer
Fine-tuning might be a one-time way out. It will let the model be better at handling such datasets through retraining on the given data. The fine-tuned model for sentiment analysis of textual stock prices should, in fact, be made to pick up the trend between those latter two features.
However, there's a catch.
While this is also supported by the same research, this capability is refined to support only similar operating data on which the models train. The immediate effect of the model on completely new data, which involves sentiment sources or new market conditions, will always put its performance down.
In other words, even fine-tuned models are not generalizable; thus, they can work with data which they have already seen, but they cannot adapt to new or evolving datasets.
Plug-ins and External Tools: One Potential Answer Integration of such systems with domain-specific tooling is one way to overcome this weakness. This is quite akin to the way that ChatGPT now integrates Wolfram Alpha for maths problems. Since ChatGPT is incapable of solving a math problem, it sends the problem further to Wolfram Alpha—a system set up and put in place exclusively for complex calculations—and then relays the answer back to them.
The exact same approach could be replicated in the case of financial analysis: Once the LLM realizes it's working with numerical data or that it has had to infer causality, then work on the general problem can be outsourced to those prepared models or algorithms that have been developed for those particular tasks. Once these analyses are done, the LLM will be able to synthesize and lastly provide an enhanced recommendation or insight. Such a hybrid approach of combining LLMs with specialized analytical tools holds the key to better performance in financial decision-making contexts. What does that mean for a financial analyst and a trader? Thus, if you plan to use ChatGPT or other LLMs in your financial flow of analysis, such limitations shall not be left unattended. Powerful the models may be for sentiment analysis, news analysis, or any type of textual data analysis, numerical analysis should not be relayed on by such models, nor correlational or causality inference-at least not without additional tools or techniques. If you want to do quantitative analysis using LLMs or trading strategies, be prepared to carry out a lot of fine-tuning and many integrations of third-party tools that will surely be able to process numerical data and more sophisticated logical reasoning. That said, one of the most exciting challenges for the future is perhaps that as research continues to sharpen their capability with numbers, causality, and correlation, the ability to use LLMs robustly within financial analysis may improve.