Big Data's Limits in Financial Markets (S2, E8) Artwork

Not Another Investment Podcast

With all the noise created by a massive retail investment sales “machine”, it can be really hard to grasp what's going on in markets today. Not Another Investment Podcast provides a fresh perspective on investing; not through opinion and anecdotes but by translating rigorous scholarship, data, and theory in a way that's understandable to everyone.

Understand investing beyond the headlines with Edward Finley, sometime Professor of Finance at the University of Virginia and veteran Wall Street investor.

All Episodes

Not Another Investment Podcast

Big Data's Limits in Financial Markets (S2, E8)

May 18, 2025 • Edward Finley • Season 2 • Episode 8

Send us a text

Data drives nearly every aspect of modern life, from the algorithms suggesting what you should watch tonight to the autonomous vehicles navigating city streets. Yet in the world of finance—where you might expect data to reign supreme—the relationship between information and decision-making is surprisingly complicated (and relatively new).

Professor Mike Gallmeyer pulls back the curtain on this fascinating paradox, revealing why financial markets present unique challenges for data-driven approaches. While Tesla collects millions of data points daily to perfect self-driving technology, investors working with a century of stock market returns have barely over a thousand data points to analyze. This fundamental limitation—what Gallmeyer calls the difference between "big data" and finance's "small data" reality—creates profound implications for how we should think about investment decisions.

The conversation delves into the historical evolution of financial data, from the pre-1960s era when decisions relied heavily on intuition and "soft information," through the development of the CRISP database at the University of Chicago, to today's sophisticated algorithmic trading systems. Gallmeyer explains how market participants continuously adapt to new information sources, creating an ever-evolving landscape where yesterday's winning strategy becomes tomorrow's conventional wisdom. This endogenous change within financial markets makes them fundamentally different from systems where data collection leads to steady, predictable improvement.

For anyone fascinated by markets, data science, or the intersection of human judgment and quantitative analysis, this episode offers valuable perspective on the promises and limitations of data-driven decision making. Whether you're managing your retirement portfolio or simply curious about how markets function, you'll gain insights into why certain problems remain resistant to even our most sophisticated analytical tools—and where human judgment still provides irreplaceable value.

Show Notes:

Dimson, Marsh, & Staunton, Global Investment Returns Yearbook 2025
Kim, Muhn, et al., Financial Statement Analysis with Large Language Models (2024)
New York Fed Staff Nowcast
Federal Reserve Bank of Atlanta, GDPNow

Thanks for listening! Please be sure to review the podcast or send your comments to me by email at info@not-another-investment-podcast.com. And tell your friends!

Speaker 1: 0:04

Hi, I'm Edward Finley, a sometime professor at the University of Virginia and a veteran Wall Street investor, and you're listening to Not Another Investment Podcast. Here we explore topics in markets and investing that every educated person should understand to be a good citizen. Welcome to the podcast. I'm Edward Finley.

Speaker 1: 0:25

These days, there's a lot of skepticism about, say, the past 80 years of leadership that's been driven by data science and an assumption that policymaking can be really precise. The thing is is the track record isn't so great. We've evolved since the 1960s in immigration policy that suddenly created huge problems for us politically. Today, Globalization, which was meant to make sure Americans got good jobs and we shipped the really crappy jobs overseas, has meant that we have a whole cast of people in this country who can't find good work, and financial liberalization in the late 80s and early 90s some would say led directly to the great financial crisis. Can we make a case for data-driven policymaking or data-driven decision-making?

Speaker 1: 1:18

Well, here to examine the strengths and maybe the limits of data-driven decision-making is Mike Gallmeyer. Mike is a professor of finance at the McIntyre School of Commerce at the University. Of data-driven decision-making is Mike Gallmeyer. Mike is a professor of finance at the McIntyre School of Commerce at the University of Virginia. His research centers around macrofinance and he teaches classes in markets, and Mike and I co-taught for some number of years a class together on how to run an endowment portfolio. Mike, welcome to the podcast. Thank you, ed, glad to be here. Let's set the stage for listeners a little, because that intro maybe sounds a little, you know, ephemeral. What's the kind of traditional role of data in financial decision-making?

Speaker 2: 2:00

Great Thanks, ed. So when we think about data in financial decision making, so many times it's viewed as a basic input to us, where we don't spend time thinking about the quality or the precision of the data. And so you think about some of the most basic things we do in finance we do valuations, we compute net, present values. We ask is this a good project or is this a good investment? And at the end of the day, all our inputs, in particular projecting cash flows, are all driven by data. But many times we rarely sort of think about well, how great is this data? Where are the sources coming from? How are we doing this? Forecasting and you and me both saw this many times with our students through the years is that they would always have this false sense of security. They'd see a number written down, it was computed from some model, and then they were always convinced it's like this is a solid number, let's run with it and it has infinite precision. Look at all the things we can do with it.

Speaker 1: 3:16

Our listeners might be surprised to learn that really a lot of data, especially as it relates to say market prices, is really rather new. When did the University of Chicago develop CRISP?

Speaker 2: 3:32

Maybe talk to listeners a little about what that is Right. So in the 1960s, the University of Chicago took on an endeavor to build a high-quality data set about financial markets. Now, the interesting thing about this data set was it initially only revolved around stocks, and it revolved around stocks in the United States, and so if we think about this data early on, it basically started in the mid-1920s until current times, but the collection, didn't?

Speaker 1: 4:11

you said I think in the beginning is in the 1960s. That's correct. When they're coming together, yes, what did financial decision makers use for data, or did they prior to the invention of CRISP? It was very ad hoc.

Speaker 2: 4:25

It was very ad hoc. It was very ad hoc, and in some cases they were relying on very little data, right, and so you can think about sort of the conventional Graham and Dodd type approaches of value investing. Many times was done on very sparse data. Now maybe that wasn't such a big deal back then in the following sense is that at the time stock ownership wasn't very broad in the United States. It was really a handful of population that was owning stocks and a lot of the ownership was just by closely held executives that probably had a great deal of soft information about these firms anyway, and so they were in the middle or the thick of all this, and so having good quality financial data to them would just augment the high-quality soft information they have.

Speaker 1: 5:25

So when a guy like John Maynard Keynes is writing in the 1930s and he pens that famous anecdote or infamous, depending on your point of view that the stock market is a bit like a newspaper beauty contest where each participant isn't really trying to discern the most beautiful face, but is trying to discern what everybody else will think is the most beautiful face, he was writing at a time where data was really rather limited. That's obviously, you know, 1930s or immediately post-war, so there was just a lot of intuition. What you're calling soft data, maybe rumor. Does that qualify as the same thing?

Speaker 2: 6:07

Not necessarily the same thing, but it's part of it. When I say soft data, I'm thinking in terms of suppose you're a CEO of a corporation at the time. You're seeing what your suppliers are doing. You're seeing what your customers are doing. You get a feel for what's happening on the production side. If you're seeing what your customers are doing, you get a feel for what's happening on the production side. You know, if you're running a steel mill in Pittsburgh in this time period, right, you're seeing what supply and demand is looking like right, I got it.

Speaker 1: 6:36

So right before CRISP is invented, there's really very little data that goes into decision making. Yeah, and then CRISP is developed. Is that the only data set that gets to, or does that kick off a process in finance of people now wanting to gather data?

Speaker 2: 6:55

Yeah, it really. It kicked off a process and it also was related to bringing statistics to finance. We can go back to the work of Harry Markowitz in the 1950s, which he won a Nobel Prize for, which we call modern portfolio theory, even though it was derived in the 1950s.

Speaker 2: 7:18

Listeners will remember an episode in season one where we go through some of the basics of modern portfolio theory, right right, and the interesting thing about Harry's work at the time was he was bringing statistics to bear on an investing problem and this was one of the earlier applications and that was novel. That was very novel. Yes, that was very novel, because we thought about stock returns as random variables now and random variables that could be modeled then. So we could talk about, well, what are average returns on a stock? We could talk about how noisy or how volatile stock returns are, or how volatile stock returns are, and earlier work largely set this kind of analysis aside, and so this was a great, great departure point.

Speaker 1: 8:21

So, in a way, it's kind of interesting to consider which begot which whether people started collecting data in order to feed models because we began applying statistical methods to financial markets, or the other way around, that we want to apply statistical methods to financial markets, but to do it we need data. So let's go out and get some data.

Speaker 2: 8:38

Very much so, very much so. And, of course, in the background of all this is that where did all this data come from originally? It came from the classic places that you think it came from. It came from the financial newspapers at the time, I see.

Speaker 1: 8:51

I see. So then I think you also get this sort of I think I'm kind of getting the idea of this development. Then you get I think it was the 1970s that Michael Bloomberg invents his terminals. I mean, we're getting to a point now where I guess data is very much a part of how markets work. Is that the impetus behind something like which we think of today as simply ubiquitous. Yes exactly, but it wasn't right.

Speaker 2: 9:16

It wasn't.

Speaker 1: 9:17

So really all of this use of data in financial markets is really relatively new like new on the scale that I was talking about a minute ago of the last 80 years, precisely, precisely. Okay, so that's the kind of setup. That's the way we thought about data. That's the way we think about it In finance traditionally, we get a lot of offshoots from that kind of thinking. I suppose one of the offshoots from that is the efficient markets hypothesis.

Speaker 2: 9:54

That's right.

Speaker 1: 9:55

How does that play into this development of now? We've got lots and lots of data and we're trying to put it to work.

Speaker 2: 10:00

That's right. That's right. So, and in fact, the inventor of the efficient Markins hypothesis, gene Fama, who won a Nobel prize for some of this work, interestingly enough, was at the beginnings of the development of the CRISP data. In fact, his dissertation was one of the early uses of the CRISP data in the 1960s, and so one of the things that came along with the efficient markets hypothesis was just thinking about how information ends up inside stock prices, and so, in other words, we observe the economy around us and there's all this information being created, and the efficient markets hypothesis was all about thinking about how information gets into market prices, ends up in prices very quickly, and it's very, very difficult to try to have this informational edge to trade on things that you don't think are in prices yet I see.

Speaker 1: 11:12

And so we can only get to that place. We can only get to the proposition in the first place of an efficient markets hypothesis If we've got the kind of data that allows us to examine market prices and changes in market prices statistically and make some assumptions about it. Or is it the other way around that the hypothesis comes after observing the data and trying to draw conclusions out of it?

Speaker 2: 11:39

It's view it more of a two-way street. It's really more of a two-way street in the sense of in the background of all this, you have players in markets who are producing information and data themselves, right, and they're producing this to be able to analyze these markets. They're asking themselves are things fairly priced? Are something out of whack here? Is there something I disagree with here? And they're using the role of data and plus a variety of models that they might be using to think about well, what's the right valuation? For? We'll use an example from maybe the 70s what's the right valuation of IBM? Is the valuation of IBM? Is this sensible right now? Or is the market missing something? And when I say the market, I mean everybody else, right?

Speaker 1: 12:34

All the other traders I see, I see. So what you get is traditional uses of data and finance in part inform the development of financial markets theory. It also helps financial markets theorists prove or disprove propositions. It gets us people starting to think about the right price of a security. We start thinking again in sort of model terms about what should you be willing to pay for a share of using your example, ibm, with the data that we have. So data is relatively new, but now we live in a world where big data is the topic. Right Big data. Talk to listeners a little bit about what we mean. First of all, just set the stage. How is that different from what you just talked about, or the same? And what are some non-finance uses for big data that have got people very excited?

Speaker 2: 13:32

Right, right, and so this is a very sort of interesting area in the following sense, is that we've just talked about the fact that finance has a long tradition of using data.

Speaker 1: 13:46

Well, long, but still pretty recent as things go. I'm an historian, so when I say pretty recent, I suppose I mean post-Civil War, but we're talking about the 1960s.

Speaker 2: 13:59

True, we're talking about the 1960s, but in all fairness, we also have to sort of think, maybe a little bit, about the history of statistics. Is that many of the tools that we're talking about did not come into play until the 20th century? And so many of the modern day statistical tools that we even teach our undergraduates, that we take for granted, really didn't come into play until the 20th century anyway?

Speaker 1: 14:26

Okay, all right, fair enough.

Speaker 2: 14:29

So this is a broad effect.

Speaker 1: 14:31

Fair enough, all right, but I cut you off. So then, big data is different or not? In what?

Speaker 2: 14:36

ways, right. So we have this big emergence of large data sets now. This big emergence of large data sets now, and let's think about some very simple examples of this for a second, and so we'll think about EVs autonomous vehicles and, in particular, one of the things that, for example, tesla works very, very hard at is how do they refine the software on a Tesla, and so they're using a variety of advanced algorithms. A lot of machine learning algorithms are involved in trying to build the software for getting to the place that we have full autonomous driving right. This requires a massive amount of data. This is autonomous driving right. This requires a massive amount of data. This is big data, right. This requires a massive amount of data. And what's nice in Tesla's case is, every time Tesla sells a car, they have an opportunity to collect more and more data. Sure, because they have more and more cars on the road, more data to collect Because it's always recording. It's always recording. More data to collect Because it's always recording.

Speaker 1: 15:46

It's always recording. It's always collecting data. It's always recording.

Speaker 2: 15:49

And another example of it is very simple. Just think about how many people use Waze for navigation nowadays, and so Waze is great because you're collecting all this data on where everyone's going right, and so you're building these very, very large data sets right.

Speaker 1: 16:07

Interesting. I suppose you could say the same about social media.

Speaker 2: 16:11

Social media. This Right.

Speaker 1: 16:13

The currency really, I suppose, if you're meta is massive, massive amounts of data on people and their preferences.

Speaker 2: 16:21

Exactly, and this is where all the value is right. This is where all the value is, and social media, in particular, was crucially important in advancements that we've done in facial recognition, because one of the things about machine learning algorithms is the way machine learning algorithms work is they need test data, and they need a massive amount of test data, and when I say test data, what I mean by this is you need a data set. Let's say we're looking at that image recognition. We want to figure out what's a cat, what's not a cat, and we need a big database of photos where someone has correctly identified cats and someone has correctly identified not cats. And machine learning algorithms use this massive amount of data to train themselves so that way, when they come up with a new picture, the algorithm can look at it and say, ah, so you've got a possum, a cat, not cat, right?

Speaker 1: 17:28

Right. I mean a lot of listeners might not realize that when they sign into certain websites and they're prompted that the website wants to make sure that they're a person, they are asked to select the cells that are the bicycles or the traffic lights. And this is CAPTCHA a nonprofit that really operates to collect exactly this sort of data, which means the data is not itself out there. The data in some respects has to be aided by human collection. We are doing it when we do the CAPTCHA.

Speaker 2: 18:02

Yeah, we're doing it Very interesting.

Speaker 1: 18:03

All right. So you've got these massive, massive data sets. Are they just different in size when we compare it to the traditional data sets we use in finance, or are they different in kind?

Speaker 2: 18:18

So this is really where the big distinction happens. Distinction happens and it's one of the things I face as a finance professor all the time, because I interact with professors in other disciplines all the time and, in particular, people in data science now, and people in data science they're always. If they're an outsider, they're always like look, we're really really clever with machine learning algorithms, we're really really clever with machine learning algorithm. Machine learning algorithms were really really clever. With all these great AI tools, we're going to kick finances ass, Okay, Because we've got so many cool tools Right and, and it's this amazing sort of naive view. And it's naive in a few dimensions. One is everyone sort of tends to not think about the fact that the finance profession, the financial services industry, has been hiring top notch data people for years, for years.

Speaker 2: 19:22

In the early 1990s I worked for JP Morgan briefly my MD PhD in physics that was not out of the norm at the time Across the trading floor, the number of PhDs on the floor in the early 90s was already big and already working on very, very hard problems. So point one is Wall Street's always been doing this kind of thing. They just, of course, don't advertise this. This is part of their secret sauce, I see, but secondly, it comes back to the big data issue. I see, but secondly, it comes back to the big data issue.

Speaker 2: 20:01

In so many parts of finance it simply isn't big data. And it's simply not big data for a very, very simple reason is that we think let's think about financial market returns for a second, let's think about the S&P 500. How many realizations of returns do we really have? So let's suppose we can get decent data on US equity markets back to the mid-1920s. So we're sitting here 2025 today. So let's say roughly that's 100 years worth of data now and, depending on what you want to use, most people like to use monthly data.

Speaker 1: 20:47

So we have monthly data, 12 months, 100 years worth of data. We have 1,200 data points. That's it. That doesn't seem like a lot. It's not a lot, I mean, especially when you compare it to things like you were describing a minute ago, like a Tesla, which has I don't know how many Teslas are on the road today, but 24-7, they're collecting data points, they're collecting data points, they're making massively large databases, whereas in finance, many times the things you're really interested in zeroing in on stock market returns, bond market returns.

Speaker 2: 21:15

The point is is, if I want more data, the only thing I can do is wait. I've got to hang around and just wait. Or you might push the envelope and say well, the stock market data in France has always been a little bit sketchy. Maybe we can go back in time and try to rebuild some of that data.

Speaker 1: 21:37

So I've read a bunch of the literature out there, for example Dimson, marsh and Stanton, that purport to have data about stock market returns in every market in which there was a stock market going back to sort of the 18th century. Am I taking your point fairly when you say maybe, but that's still not big? Exactly?

Speaker 2: 22:05

It's still not big. It seems big. It seems big. Okay, so great. If we go back to the work that Dimson, marsh and Stanton has done. Basically what they've done is they've their data usually starts about 1900. And what they try to do is they look at parts of the world that had stock markets so largely it's largely dominated by Europe. And so you start to think about, well, how many countries do you have in that data set? You have 40, 50 countries, most likely roughly around there. And so now you know, take your 40, 50 number, multiply by, you know 1,200. If we have monthly data points again, we still don't have an incredibly large amount of data.

Speaker 1: 22:52

Certainly not as many data points as you would have, collecting every right turn and every stop sign of a Tesla driving on the road. Precisely Interesting, Precisely so. The data in finance is small and that's maybe a feature of just how long markets have been functioning. Or is there another reason why it's small? Why is it? Why can't we have more of this Tesla-like data where people are measuring minute by minute, second by second, everything that happens in markets?

Speaker 2: 23:34

Well, we have small examples of this, right so we have small examples. This, right so we have small examples. We have it in credit data. So individual household credit data right? So if you think about the credit bureaus, experian, transunion, think about the credit agencies for a second, they have gigantic cross-sections of households, right so, they have big cross-sections of households Again it cross-sections of households right? So they have big cross-sections of households. These are places where you do have bigger data sets, but again, once it comes back to market prices and we're looking at trading of securities, we still don't have a ton of data. Yes, there's some circumstances where we can look at, for example, stock market return data at very high frequencies, with the arrival of high frequency traders starting in about almost 20 years ago now. These are very high frequency in the sense of you have this data that is down to milliseconds. But the problem with it is it's not so useful for most applications because it's very noisy, because it's all driven by these trading processes.

Speaker 1: 24:51

Well, this brings us to sort of. The second idea is that, well, in finance, data is going to be smaller than what we're accustomed to seeing in the world of big data. That's right. The second is what is sometimes referred to as signal-to-noise ratio. First, set the listener's baseline of what we mean when we say that, and then tell us how financial data maybe differs from big data in this respect.

Speaker 2: 25:18

Right, right. So when we think about data and it's common that people use the terminology about signal to noise ratio in the data, and what they mean by that is, what can I learn? So we'll take stock returns as our example here for a second. What I would really like to learn is suppose I'm investing in US large caps and what I'd really like to know is well, what's the average return that I can expect by investing in US large caps? And so that's actually not so easy to figure out. That's actually not so easy to figure out. And basic statistics 101 says the following Well, collect some data, collect returns. Say we collect monthly returns and then just compute a simple average in this. That's an estimate of the mean or the average return. What we would expect.

Speaker 2: 26:18

The issue is because financial markets are driven by an immense amount of noise. That mean estimate we come up with is incredibly noisy.

Speaker 1: 26:33

And by noise, I think. Let me see if I've got the sense of what you're talking about by noise is the participants in markets are, minute by minute, second by second even, I suppose, trying to discern what each stock is worth, and that's part prediction, but also part intuition. And the degree to which those views are moving around second to second, minute to minute, isn't telling you what the average return will be. What you're really observing is people trying to figure out what the average return will be, and is that fair? That's noise.

Speaker 2: 27:10

So there's, price movement.

Speaker 1: 27:13

That's really the activity of people trying to find what the right price is, but there's also activity of people in all of that data. That is the average. That is going to be what the thing is worth.

Speaker 2: 27:26

That's signal.

Speaker 1: 27:27

Exactly, that's signal. I understand that's signal, all right. So signal to noise and in big data what's that?

Speaker 2: 27:34

look like so in many, many big data applications the signal-to-noise ratio is much, much higher. You can have engineering applications where signal-to-noise ratios are close to one. Even in things like facial recognition, signal-to-noise ratios are super high in this data. Or back to our example of trying to figure out is there a cat in the picture or not. These are applications where signal to noise is quite high.

Speaker 1: 28:06

But in financial markets Wait, I just want to touch on that for a second, dig in a little. Is that because the nature of the signal is more objectively identifiable, objectively provable? Or is that for some because of how much data there is, or some other? Third reason why is it the signal-to-noise ratio higher?

Speaker 2: 28:28

The signal-to-noise ratio in a lot of big data applications is much higher because it's easy to discern Coming back again to. So it's easy to discern Coming back again to so it's objectively identifiable Exactly it is a cat or it isn't. It is a cat or it isn't right, right, okay, gotcha.

Speaker 2: 28:47

You and me can sit there and we can look at the pictures, we can flip through them and go cat, cat, dog, cat, cat, right Sure sure, and with the like we were talking about Tesla data it's a stop sign or it's not a stop sign.

Speaker 1: 28:59

It's a traffic cone or it's not a traffic cone. Exactly, uh-huh, exactly.

Speaker 2: 29:02

Financial market data though.

Speaker 1: 29:03

Right.

Speaker 2: 29:04

We sit there and we look at returns and who knows, we look at them and it's like I don't know if that came from a stock, I don't know if that came from a bond. Right, it's very hard.

Speaker 1: 29:15

Well, or to the point I was making earlier, if the elements that contribute to the price are part fact, objectively identifiable, and part speculation, I'm trying to figure out what the future will look like Exactly. It seems you don't have the objectively identifiable thing to confirm. I see that's right. So why the signal to noise ratio is so different? That it is in part due to what it is financial markets are doing. We're not identifying cats and stop signs Exactly. We're trying to predict the future.

Speaker 2: 29:51

Exactly. We are trying to discern what are good investments, what are not good investments, and what does that mean?

Speaker 1: 29:56

concretely in terms of applications that try to use big data in financial markets. Well, first, maybe can I just pin that for a second and ask kind of a footnotey question may even work with large financial institutions, where those folks will map out for them an asset allocation that will be suitable for their goals. Would we say that those are applications of big data or not so much?

Speaker 2: 30:35

Those aren't really applications of big data at the end of the day.

Speaker 1: 30:38

What do you give as an example for people who want to put their finger on a big data application in finance to get a sense for what that means?

Speaker 2: 30:48

So really the big success stories in big data in finance to date first of all have revolved around and this is more in the vein of institutional traders so large institutions and much of the really good work has revolved around trade execution Like I need to sell a large quantity of shares, what's the optimal way to do this? And this is where big data applications have been useful in the high frequency domain.

Speaker 1: 31:28

And so Is that the like of Citadel and so on, who make massive markets over the counter, exactly.

Speaker 2: 31:35

Yes, exactly so Citadel Securities would be sort of a big player in this. And Citadel Securities is actually a very interesting example because one of the so if we take the trading platform Robinhood for a second, sure, so Robinhood has plenty of our students on it.

Speaker 1: 31:59

Okay, for better or for worse? Better or for worse.

Speaker 2: 32:04

No slight to Robinhood out there no slight to Robinhood at all.

Speaker 2: 32:07

Right, no slight to Robinhood at all. But there are plenty of people who are thinking about markets in this gamified way and they're drawn to platforms like this, in this gamified way, and they're drawn to platforms like this. The interesting thing is is that the way Robinhood makes money is this idea called payment for order flow. And payment for order flow basically means they route their trades to specific firms and the firms pay for these orders to specific firms and the firms pay for these orders. So, for example, citadel Securities will pay Robinhood for these orders. They will pay large amounts for these orders, and so you're looking at so.

Speaker 1: 32:53

So let me see if I've got this right. So then Citadel will pay for that trade flow. Because Right. So then Citadel will pay for that trade flow because they've got this big data and an algorithm that helps them very precisely figure out the timing and trading and they make money on the trade. Is that why they pay for this order flow?

Speaker 2: 33:21

Exactly. They pay for this order flow because they have this technology that they can use to say ah, these are these orders coming in. And the other reason that we haven't emphasized at all that they're very willing to pay a lot for these orders is the fact that, ex-ante they have a pretty strong view that most of the traders coming from Robinhood are pretty uninformed, meaning they don't have some informational edge over Citadel. These are largely retail traders. They're trading in ways that probably aren't really well grounded in thinking hard about the fundamentals of these securities More noise, more noise, less signal, less signal.

Speaker 2: 33:59

And this is beautiful for Citadel in thinking hard about the fundamentals of these securities More noise, more noise, less signal, and this is beautiful for Citadel. What Citadel doesn't want to trade, who they don't want to trade against, are informed traders, I see.

Speaker 1: 34:09

Traders with a lot of information. So is it fair to say, then, that, in finance, big data has some really potent application at the micro scale, exactly? Yeah, yes, yeah, but that's different than having some application in the things that I gave as an illustration a minute ago. So what should your asset allocation be? Well, that's going to be a function of predicting the portfolios, or each asset class's average returns and average volatility and average correlations, and at that scale, not micro. Much more noise, much less signal, exactly, and so limited application, exactly.

Speaker 1: 35:05

One of the other things that I think you alluded to a little bit ago was the efficient markets hypothesis and how this was an attempt to come at how market pricing really works, for securities depended a fair bit on data. There needed to be data in order to sort of prove out the proposition, but it seems to me that the efficient markets hypothesis contains within it also a very interesting nodule that the market is constantly changing and evolving. Can you talk a little bit about that evolution of markets and how that plays into the use of data?

Speaker 2: 35:41

Right. So in some sense it's. And especially as a professor teaching this material, one of the traps that students always fall into is they treat the efficient markets hypothesis as some monolithic thing and it's I don't know what you mean. Yeah, you don't know what I mean, right, they treat it as this monolithic thing and it's I don't know what you mean. Yeah, you don't know what I mean, right.

Speaker 1: 36:02

They treat it as this monolithic thing, any more or less than the cap M. Yeah, yeah.

Speaker 2: 36:09

They get this false sense of security that they feel like it's a physics class and they feel like this is a law. Right, and so this is a law of motion or something about markets. But the key thing about the efficient markets hypothesis is the fact that we are not talking about a static level of market efficiency. It's the fact that markets are always evolving, in the sense of industries are always changing. You think about sort of the dynamism of the US economy where we can go back to the 1970s. The largest firms on US stock markets were largely all firms that were tangible capital. Think manufacturing firms, think firms like 3M, dupont, dupont, exxonmobil right, these were the big firms right. And now fast forward to today, the big firms on US stock markets today, all tech firms. Right, they're all tech firms and they have very little physical capital. It's all intangible capital now. And so we've had this giant morph of just what are successful large corporations, even and this is just one theme of many of how markets dynamically change. All the time We've had a whole we'll call it a big bang in finance in the following sense Trading costs have gotten very, very small.

Speaker 2: 37:54

Right, we go back to the 1960s a round trip trade in equity markets for 100 shares back then would have cost you $200 in commission just to trade 100 shares. It would have been a small fortune. Yeah, it's material. Yeah, it's very material, very material that we can do these trades at much, much lower cost today. Now is it free? No, it's not. We come back to that payment for order flow again. It isn't free. There's still trading costs there.

Speaker 1: 38:23

It just may be that the trader on Robinhood isn't paying for it, but someone's paying for it.

Speaker 2: 38:27

Someone's paying for it. Yes, someone is paying for it. So markets continue to evolve, and they also continue to evolve in the sense of what type of information is important where we are in economic regimes you can just think about the turmoil in markets right now is that we've done this full tilt now and now we're thinking hard about, well, what really are the implications for tariffs on all these firms.

Speaker 1: 38:58

That's a great. It's too bad that we didn't talk about it earlier, but it's just as good to talk about it here. We're recording on April 24th, so we're toward the end of April and we've gone through a really peculiar period in which it's very unclear, policy is very unclear, and when policy is very unclear, our ability, market's ability to predict the future is even less clear than usual, and so there's a lot of noise in the way that markets are behaving. So I get you the efficient markets hypothesis.

Speaker 1: 39:37

By its very nature why I said nodule, maybe that's the wrong word to use but by its very nature imagines that market participants are engaged in trying to collect as much information as they can in order to try to predict what the true value of a security is. Embedded in that statement is, as time goes on, the things that will tell you what a security is worth change. That's right, and therefore it isn't some static notion. Maybe the hypothesis is static, maybe the notion of trying to incorporate all relevant information into price remains the same, but what that information is changes. What's then the impact on data? If that's the case, like does data become less potent, more potent in financial markets, when we've got that kind of evolution?

Speaker 2: 40:31

It's a continued vicious cycle in the following sense is that you're always trying to figure out. Suppose your goal is you're a hedge fund and you're a hedge fund and your strategy is based on having some informational edge, and that informational edge is thinking about trying to use data in interesting ways, right? In some sense, you can think about this as a bit of an arms race, because if I can find data or use data in a way that no one else is doing it yet, this could be super useful for me, right? So we can think about social media examples for a second. Already, five plus years ago, some of the large multi-strategy hedge funds were already engaged in looking at places like Instagram. And they were looking at Instagram because what they were looking at is they're looking at influencers and what products the influencers have in the pictures and asking does this have impact on trading LVMH?

Speaker 1: 41:45

Yeah, sure, what are people buying? Yes, what are people buying.

Speaker 2: 41:49

Learning about things in nontraditional ways. I gotcha.

Speaker 1: 41:53

It seems okay so that's a very I like that illustration that there are ways in which big data can play a role in an evolving market. It's just another source of information, and here I guess we're really talking about properly big data. You used Instagram as an example, but I'm not hearing you say and so let's just see if the negative is true. I'm not hearing you say that big data is finding its way into more traditional finance applications like portfolio construction. Is that fair?

Speaker 2: 42:30

That is very fair. It's very fair and we're still faced with the plague of it's still largely a realm of small or even medium data. And, in some sense, coming back to the idea of how markets have evolved, one area where markets have evolved greatly is in alternative investments into private capital markets, so the rise of firms like Apollo and Carlyle and KKR. They all are involved in strategies where you don't see daily prices. It's not like I can see dive into, say, Blackstone's portfolio and say let's look at their real estate, private equity vehicles and see a day-by-day valuation on these things. I can't go find those prices right. And those assets, however, have become much more important to investors and in some sense, it's a world where we're still trying to figure out actually even what average returns look like in those spaces.

Speaker 1: 43:46

Well, yeah, I mean, and listeners will remember the couple of episodes that we did in season one in which we unpacked the ideas behind the value of investing in private equity, and one of them, at least, is that they're not traded regularly, not priced regularly. Instead, there are quarterly valuations on a one-quarter lag, and so I hear you that there it's going to be, an innovation that makes data in financial markets even less tractable let's call it.

Speaker 1: 44:23

Is it the case, though, that maybe big data I'm going to use that phrase, as you've been using it is something that KKR is using in terms of evaluating its potential portfolio companies and which companies they should buy and not buy, or is that maybe a bridge too far?

Speaker 2: 44:43

Everyone's experimenting right. Everyone's experimenting right now, and they're trying a variety of things. Some things will be useful, some things won't, and we're still feeling our way through this. So, yes, there are some circumstances. We could take even a simpler example.

Speaker 2: 45:04

One thing that came about sort of very early on in the attempts to use sort of some of the big data tools was just analyzing earnings calls.

Speaker 2: 45:10

Attempts to use sort of some of the big data tools was just analyzing earnings calls, and so, early on, what would happen and this very much comes back to related to how markets evolve people started to do very mundane things, and the mundane thing you would do is just you do things like word counts.

Speaker 2: 45:31

So you would take an earnings call and you would just do word counts. And so suppose we wanted to try to analyze tariff uncertainty for firms, so we could take earnings calls and we could pick out references to tariffs, maybe negative terms about tariffs, and so we could build simple word counts on this. And then the tools got more advanced. People started to use machine learning algorithms across this data to try to ask well, can I learn something new out of this, or something out of this that the CEO or the CFO might not be conveying in these calls. But then of course, you take the flip side of this. The firms themselves are realizing that everybody's doing this. And so now, of course, now there are consulting companies, now that consult to firms On the words to use On the words to use.

Speaker 1: 46:33

Well, I'm not surprised yes as the words to use. There are a couple of papers that I read a few years back that I'll drop in the notes for listeners who are interested in which the authors sought to use large language models in order to interpret quarterly financials and listen to earnings calls and then predict what they thought the stock price was, and they compared that over time with analyst predictions. Right, and they sort of measured overs and unders. The authors of the paper made a very big deal about how much better the AI large language model was than the average analyst. I don't know. Looking at the same data, I drew a different conclusion. I didn't think it was that sort of notable, but it's worth saying at least that it's out there.

Speaker 1: 47:31

Yes, it's out there and people are trying to apply and I would say AI and large language models fall into that category of big data, the data that we have in markets.

Speaker 2: 47:53

So I'm not talking about big data but the data that we have in markets may be sufficient, but its interpretation is hard so let's think about trying to just figure out what GDP growth is for this quarter. This is sort of a great place to sort of think about this. The US government collects through many of the parts of the executive branch, bureau of Labor Statistics or even in the Fed. There's immense amount of data about the economy that comes in all the time jobs reports and if you just think about when these reports come in, they all come in with different timing, they come in with different horizons. Some of the jobs data we see is down to the weekly numbers, whereas if we look at things like industrial production numbers, these are, at best, monthly data. Typically, we think about them more like quarterly, though this is all data that, when it comes in, we're all sifting around trying to figure out.

Speaker 2: 49:18

Like, what does it say? It's sort of like, when job numbers come out, one job number comes out that talks about, say, maybe, listings of new positions, and then you also have unemployment data that comes out, and the two just don't seem to match with each other, and so one of the struggles that we always face is trying to take and I like to think of them as puzzle pieces Is that each one of these bits of data is perhaps handing you a puzzle piece? It's handing you a puzzle piece and Lord knows what you got. You might have a corner you might have, lord knows, in the middle of the puzzle. Please, god, it's not sky.

Speaker 1: 50:00

Yeah, yeah, yeah, exactly Damn sky Cause I don't know where.

Speaker 2: 50:03

I don't know where this goes. Or is that ocean? I don't know. Maybe it's ocean right, maybe it's ocean and so that there actually has been some recent work done. It actually started before COVID and it's being looked at. More again is that some of the central banks are engaged in exercises they call now casting, which this now casting idea is really super interesting because, instead of forecasting, we're just trying to figure out where we are right now in the economy, and so, in some sense, think of it as if I can give you a barometer on the economy right now.

Speaker 2: 50:51

Here's what economic growth looks like, this is what GDP growth looks like, and it's an attempt to use all of these puzzle pieces, all these different pieces of macro data that were released, and try to ask how informative are they? About filling in the puzzle. And what's fascinating about this stuff is the techniques used. I'll call it medium data techniques, and the techniques used are all revolve around the fact that data sets aren't infinitely large they're small, but yet all of this macroeconomic data, whether it's about jobs, whether it's about industrial production, whether it's about surveys of businesses or consumers, it's all puzzle pieces about the overall economy, and it's about using all of that data in a sensible way to say I get this puzzle piece and I'm looking at it and it's like I can try to figure out. It's like that's probably sky, not ocean Great.

Speaker 1: 52:00

Is that because you've got lots and lots of in any one application the data is small. But when you've got lots and lots of different applications consumer sentiment, business surveys, labor surveys, hiring firing In the aggregate, that's what makes it medium.

Speaker 2: 52:18

Exactly that's what makes it medium and the interesting part that central bankers have found on this. And so, by the way, sort of a great example of this is the New York Fed has a model, which they call the Nowcast model, and really, what it's meant to do, it's meant to give you a real-time view of what GDP growth is for this quarter. One thing to sort of remember about this macro data is that if I want to know first quarter's GDP growth, that number is not going to get announced until after the quarter. You're not going to see that number until after the quarter. And then nine times out of 10, by the way, it's going to be revised later on, either up or down. And so this work by some of the Fed banks New York Fed has a model. The Atlanta Fed has a model where all they're trying to do is they're trying to ask, like, where are we right now in the economy? And the interesting part is that I know-.

Speaker 1: 53:21

Is the noise to signal or the signal to noise ratio any better when you're doing now casting as opposed to forecasting?

Speaker 2: 53:30

It's I think the way to think about it is a slightly different way, and it's a slightly different way in the following Is that and I'll set the signal to noise ratio stuff aside for a second what you tend to find in this is that let's suppose I have, you know, around 100 or 200 macro measures, and this is sort of the numbers they have in these type of models. So you have 100 to 200 signals. The interesting thing about it is is that when you try to crunch this data, what you find is is that there's a handful of underlying factors that seem to be driving GDP growth. Now. They're dynamically changing, though they're not static through time. They're dynamically changing.

Speaker 2: 54:24

For example, we've been coming off of this elevated inflation regime lately, right, and so over the last few years, it's been all really about inflation, right, it's all been about inflation, and so things that are correlated or driven by these inflation concerns became more and more important in these models. But the key thing is that a handful of latent or unknown factors are the drivers of these hundreds of macro variables, then, and so what people at some of the central banks have been able to do is they've been able to figure out how to take all of these different signals and try to distill out of them these handful of informative pieces that teach us about okay, where are we in the economy right now? And you're saying that changes.

Speaker 1: 55:18

And that changes. And it changes because they change their model, or that changes because the model is adaptive.

Speaker 2: 55:24

It changes because the model is adaptive.

Speaker 1: 55:27

actually, I mean we've left it's flexible enough We've left out of this discussion and it's kind of a neat point to raise now is that whether you're talking about financial markets or the economy, they're both complex systems which by definition means they have endogenous change change from within, not only change from without. And when you've got a complex system where change can be fomented from within, it is very, very different than the kinds of things we were talking about earlier where big data has such strong application, because when things can change from within, it's super, super hard to predict the factors that are going to drive that change. It's almost dare I quote Donald Rumsfeld it's the unknown, unknown.

Speaker 2: 56:16

Exactly.

Speaker 1: 56:24

Okay, mike, that's really, really useful and, I thought, super interesting. Can I end with one last question, and this is a question that I ask all of my guests and that we hadn't talked about before. But share with listeners your pearl of wisdom. It can be something related to markets and finance, it can be personal, it can be yogi-like, whatever happens to be on your mind, but what would you share with listeners as your pearl of wisdom?

Speaker 2: 56:53

would you share with listeners as your pearl of wisdom? So part of my career, of course, is educating or poisoning minds?

Speaker 1: 57:06

I'm not sure which it's probably poisoning minds more than anything else.

Speaker 2: 57:08

Better than poisoning pigeons in the park. Exactly, exactly. But you know the common theme that I find that is so important and it hasn't changed with students. I've been a professor now since 1998. People are always asking me it's like, well, how have students changed? How have students changed? It's like, well, there's one thing about students that is still super, super strong they're all intellectually curious. About students that are still super, super strong, they're all intellectually curious. And if you don't stay intellectually curious, it's just absolutely no fun. It's absolutely no fun.

Speaker 2: 57:49

And so you know, one of the things that I see over and over again with my students, which I'm still very impressed with there's a set of them that are out there and they're still out there and reading, and they're reading interesting things about a broad number of topics. And don't pigeonhole yourself in like I'm just going to read about finance and economics. Learn more broadly and you learn more broadly. And you can take so many of these lessons to the task at hand or just for your own knowledge and gratification to learn something new all the time.

Speaker 2: 58:33

I'm always fascinated by the fact that there's people that they really aren't interested in doing this and it just feels somewhat soulless to me. It's like. So pick up that book, pick up that article it's a sub stack, whatever, but make sure you're reading interesting things. It's a sub stack, whatever, but make sure you're reading interesting things. Don't get in your own lane and not thinking about alternative views, what others are talking about. Now I'm spoiled. I'm a university professor. I get to be surrounded by this all the time and I know lots of people work in areas where they can't get this exposure, and what's beautiful is that. One thing that has changed is we have podcasts, we have sub stacks nowadays where we can learn things we never used to be able to do, or or you can. You can try your luck and put it into a large language model and hope it's. Hopefully it's not hallucinating with you and maybe, hopefully, it's a good hallucination if it is.

Speaker 1: 59:40

Mike, that's terrific. Thanks so much, this was great.

Speaker 2: 59:43

Thank you, Ed Greatly enjoyed it.

Speaker 1: 59:46

You've been listening to Not Another Investment Podcast hosted by me, Edward Finley. You can find research links and charts at notanotherinvestmentpodcastcom. And don't forget to follow us on your favorite platform and leave comments. Thanks for listening.

Edward Finley

Host