Training data is the input that teaches a model what to predict (mentioned as most valuable data source in @1:04:02 of Invest Like the Best podcast. At CircleUp we predict the future revenue growth of early-stage private companies so our training data is ground truth financial data from such brands. This is a tremendous barrier to entry. For this training data to be effective we need to have (1) enough volume and (2) a representative/variable sample. The volume allows for better predictive performance and a representative/variable sample makes it more generalizable and robust across different scenarios.
Now bear with me for a brief tangent (shoutout to Ashlee Bennett for the inspiration). Imagine you and your friend (an algorithm) are walking through the jungle. You turn a corner and see a tiger. What do you do? You have never seen a Tiger in-person but instinct sets in so you run as fast as you can. You survive. The Algorithm has seen a handful of Tigers but has no reason to believe they are dangerous so it doesn’t move. The Tiger attacks.
You clearly outsmarted the algorithm. But why?
For the algorithm to recognize danger in this situation it would have to have been trained on thousands of instances of someone being attacked and mauled by a tiger (that is the volume point). But there are <85 tiger attacks per year. On top of that, the algorithm would have to have been exposed to maulings at different times of day, with different backdrops, and with different sized and colored cats to account for all the possible variances (this is the representative sample point).
In short, algorithms aren’t all that great at predicting very unlikely events because there typically isn’t enough representative training data to generate a reliable prediction.
Now to bring it back to private market investing – the perfect training data would be a clean and robust history of company exits. But, similar to tiger attacks, there isn’t enough training data (across size, industry, time, deal performance, etc.) to train such a model in a rigorous way. Imagine we observe two successful popcorn exits in one year and both companies have founders that were born in South Dakota. Suddenly the model only surfaces popcorn companies originating in South Dakota because 100% of the popcorn exits from the previous year matched that spec. It’s a clear spurious correlation. To solve for the problem of sample size we instead use revenue growth as our objective function – it is the metric that is most correlated with exits and has a direct relationship with enterprise value in CPG.
We have accumulated (and continue to collect) training data from tens of thousands of companies across industries and stages to test for meaningful signal. But that is not to say it is an easy task. Typical consumer funds only looks at hundreds of deals (and financials) per year. To capture this information at scale, we have built a standardized way to validate and ingest such training data so it is suited for the data science team.