CircleUp’s machine learning platform Helio tracks 1.4M companies and assigns each company a parent category (e.g. Food) and a subcategory (e.g. Cheese Alternatives). It is a hierarchical taxonomy system consisting of two conditionally dependent multi-class text classifiers. Companies are assigned into 13 parent categories and >100 subcategories. These category assignments are hugely important for sourcing and evaluation. They allow us to run evaluative models at the category level and define competitive sets of similar companies at similar sizes. Imagine comparing the brand resonance of an infant food to that of a shampoo or granola bar. Not helpful. Instead, we make evaluative comparisons within a single category.
The best ways to optimize these models are to collect a wide range of features, drawn from a diverse range of sources, and refresh the training sample regularly to stay on top of drifting signals. Black box architectures such as support vector machines, gradient-boosted decision trees, and deep learning models work well for this use case. Take the shown descriptive text as an example. We use NLP techniques to define the parent category (or multiple categories), in this case, Food.
The classification is probabilistic, and therefore not perfect, but we are able to quantify the error rate and improve it over time to minimize misclassification.