In 1886, Atlanta pharmacist John Pemberton combined a wide range of ingredients to create the now world-famous Coca-Cola. Coke has kept their soft drink formula a secret ever since and today the recipe sits locked away in a vault in Atlanta. While there have been countless other soft drink manufacturers that have tried to create their own versions of Coke, none have managed to replicate Pemberton’s original formula.
CircleUp is an investment platform fueled by Helio, but unlike Coke we don’t need to lock Helio’s ingredients in a secret vault to make it defensible. Helio is comprised of billions of data points and proprietary models to identify and evaluate breakout consumer brands as investment opportunities. Helio continues to evolve and improve but we’re excited about our technology, and we are confident that Helio lets us look at the consumer industry today in a way that no one else can.
Although we’ve been fairly transparent about the data components that go into Helio, we wanted to also talk about some of the complexity that goes into building Helio and why we think this makes Helio defensible.
Scale and complexity of the data sources
Helio systematically collects data from hundreds of sources to identify and continually evaluate almost every consumer and retail company in North America; more than 1 million companies and growing every day. We believe this holistic, bottoms up, understanding of the consumer landscape is critical to identify patterns and predict performance at a scale not possible if we relied on each company’s proprietary financial information.
The massive scope of these data sources requires a significant investment in our data pipeline to extract and transform large, unstructured — and often messy — data sets. We process many terabytes of data every single month. For perspective, the amount of data we collect and transform every month is the equivalent of thousands of copies of the Encyclopedia Britannica.
Additionally, much of the data in our models was collected years ago and is no longer available. This inability for others to retroactively back-collect many of our data sets creates a prominent barrier to entry.
The data also has network effects, another meaningful barrier to entry. The more data we add to our models, the more accurate they become. As an example, the below chart demonstrates the accuracy of our revenue estimation model by the percent of data used in the model.
We anticipate that this model will become even more accurate in the future with the ingestion of new data sources. We’re confident that our technology will continue to improve and improve at ever increasing speeds.
Proprietary training data
The mix of public data, partnership data, and practitioner data utilized by Helio allows us to evaluate companies and predict outcomes at scale while still being grounded in empirical evidence. This drives a feedback loop that continuously improves model performance as we scale our business offerings and add more data to Helio.
Although financial data about public companies are readily available, private company data are more elusive. At CircleUp, we’ve had the privilege to work with many companies over the past several years which provides us with a glimpse into private financials. We never disclose this information publicly or to third parties, but this data can be used as labeled training data to improve the accuracy of our models’ predictions. We call this our practitioner data, and it’s a critical element to how we predict and prescribe outcomes.
We’ve conducted analyses on the accuracy of Helio’s revenue estimation model when trained on public & partnership data alone compared to the accuracy when we add practitioner training data, which you can see illustrated below.
The x-axis of the graphs represents the actual revenue for a sample of companies while the y-axis represents our model’s predicted revenue for those companies. As you can see, the predictions are much more accurate with the addition of the practitioner data. Additionally, the model trained on the practitioner and non-practitioner data is much more generalizable, meaning that insights can be extrapolated to companies beyond this dataset. For this model, 80% of the predictive power comes from the variables that are the most difficult to get.
Built for an omnichannel world
With the rise of the internet, customers now have more purchase channels available than ever before. Consider a product like shampoo. Today’s consumer can buy shampoo at their local beauty store, order shampoo through grocery delivery apps, buy it through online retailers, purchase it directly from a brand’s website, or several other channels.
A wealth of important data on a brand is still created in the “real world.” The stores where a product is sold, the number of products (or SKUs) a brand produces, the price points of the SKUs, its packaging and product positioning, the ingredients of a product, and the sales velocity are all indicators vitally important to understanding a brand’s performance and potential.
But recently, the online data about a brand has become equally important. A brand’s positioning, its presence on social media, the way consumers are talking about it and reviewing it, the presence or absence of online sales channels are all meaningful indicators that are also important to understand.
Linking these offline and online data points is not an easy task, but it is critical for a holistic understanding of the consumer landscape. If two different sources refer to the same product in different ways, it is difficult but essential to recognize that these two products are the same. We’ve devoted significant resources to solving this problem, known as entity resolution, at scale and have spent large amounts of time in normalizing data in order to link all the information on a brand that we think is meaningful in a seamless way that lends itself to analysis.
How Helio helps CircleUp
We track and evaluate almost every consumer and retail company in North America. Helio proactively finds new companies every day across an array of sources. We don’t use “lists” of companies- the algorithms find new ones. The data that Helio collects on each of those companies feed into our predictive, interpretable models, which evaluate companies on dimensions that are aligned with how investors evaluate consumer and retail, but our models are driven by facts instead of opinion. Rather than only developing investment themes and seeking companies aligned with those themes, we apply empirical data science to identify companies and segments of the market that have significant growth opportunity. We’ve found that most investment firms use hypotheses that haven’t been tested empirically, but with Helio we are able to use empirical evidence and statistically-identified patterns.
Helio allows us to identify a company’s strengths and opportunity areas in a way that is easily understandable by humans. The interpretability of the algorithms allow us to expand into prescriptive analytics to prescribe actions that can increase a company’s probability of success across a range of different areas.
Helio has countless use cases – it can be used by investors, retailers, branding firms, data providers and of course, the entrepreneurs themselves. Internally at CircleUp, Helio empowers our marketplace, our credit platform and our equity funds. Helio isn’t perfect, but perfection isn’t our goal. Our goal is to build a technology that can help entrepreneurs to thrive by providing the capital and resources they need. We are enormously proud of the progress we’ve made on Helio and excited by the direction we’re heading.