Take five brands with the identical name – e.g. Hum – which span skincare, electronics, restaurant supply, spirits, and beauty supplements. How do we, as humans, distinguish between them and determine whether they are distinct entities or duplicates?
More interestingly, how would we recreate this process to scale across millions of companies? The challenge is massive – there is no standardized or unique identifier for private companies or the brands that roll up into those companies. The public markets benefit from unique tickers for trading. At CircleUp we have taken on the task of assigning unique identifiers in the private markets. The difference is, there are 6,000,000 private companies in the US alone – quite a few more than than the 4,000 public companies. Additionally, we are dealing with messy data as compared to the standardized metrics on public companies.
The implications of this task are tremendous. Only with a complete picture of the private company landscape can we build out a true competitive landscape and identify specific points of product differentiation. Most data sets are not representative of the broader landscape because they exclude emerging brands and skew towards larger, more established names and the corresponding big retailers. We are changing that.
CircleUp’s Growth Partners saw reasons to lead an investment in HUM Nutrition. This wouldn’t have happened without Helio.
Without the existence of standardized identifiers (e.g tickers in the public markets, social security numbers for individuals) we are left with brand identity ambiguity, which we tackle head on.
We use unstructured text to link brands and split brands (e.g. if one Hum is associated with a higher frequency of “wine”, and another with “face,” we split), then assign each a unique identifier. Perfection is aspirational in these ambiguous areas. Today we’re using TF-IDF and doing a dot product between two TF-IDF matrices to generate “similarity” scores. We are moving towards a deep learning (DL) recurrent neural network (RNN) architecture where we use an attention block to read text, and classify it. This has the ability to learn temporal patterns in text. TF-IDF alone will not capture the subtleties in “chocolate color paint” being “paint” and not “chocolate.”
We normalize all of our data, which lets us ingest data from a new source without having to manually resolve it against hundreds of other sources – we literally “plug in” new sources to our system then re-calibrate the weightings of each data set.
The system makes heavy use of distributed computation via Apache Spark, and the model trains in 5–10 minutes from scratch. By utilizing a custom version of distributed randomized hyperparameter search, we’re able to save hours (and sometimes days) with 40 machines. If you have any idea what this means, you’ll be impressed with the speed of iteration it allows on the algorithm for engineering & data science teams.