Data gives AI startups a defensive moat: The more data the startup collects to train an AI model, the better that model will perform, making it difficult for a new entrant to catch up. That data does not come for free, however, and many AI startups see their margins eroded by this additional cost. You might hope to spend less on data as your models improve over time, but it’s unclear how to predict when that will happen and to what degree, making it difficult to model your future growth.
Unlike software startups where product development is buried under research and development costs in the P&L, AI startups should account for data costs as part of the cost of goods sold (COGS). Thinking about data as COGS instead of as R&D costs will help you identify opportunities for scaling up and driving costs down to increase your margins.
The Data Value Chain flow chart below shows how most AI startups acquire and use data. First, you record snippets of ground truth as raw data. You store that raw data somewhere and then establish processes or pipelines to maintain and access it. Before you use it in an AI model, you need to annotate the data so the model knows what to do with each data point. The trained model then takes in the data and returns a recommendation, which you can then use to take an action that drives some kind of outcome for the end user. This process can be separated into three distinct steps: acquiring data, storing the data, and annotating the data to train the model. Each step incurs a cost.
Cost of data acquisition
In all data value chains, some kind of sensor (either a physical device or a human being) first needs to collect raw data by capturing observations of reality. In this case, the costs from data acquisition come from creating, distributing, and operating the sensor. If that sensor is a piece of hardware, you must consider the cost of materials and manufacturing; if the sensor is a human, the costs come from recruiting and providing them with the tools they need to make and record the observations. Depending on how broad your coverage needs to be, you may need to pay a significant amount to distribute the sensors. Some use cases may need data collected at a high frequency, which may also drive up the labor and maintenance costs. Audience measurement company Nielsen, for example, faces all of these costs because it both provides the boxes and pays participants to report what they watch on TV. In this case, economies of scale drive down the per unit data acquisition costs as Nielsen’s data becomes more valuable the more comprehensive its coverage gets.
In some use cases, you may be able to transfer the work and cost of data acquisition to the end user by offering them a tool to manage their workflow (an automatic email response generator, for example) and then storing the data they capture in their work or observing their interactions with the tool and recording it as data. If you choose to distribute these tools for free, the cost of data acquisition will be the cost of customer acquisition efforts. Alternatively, you might choose to charge for the workflow tool, which could slow and limit customer adoption and, consequently, data acquisition while offsetting the data acquisition costs, depending on how you price.
One of my firm’s portfolio companies, InsideSales, for example, offers a platform for sales reps to dial their leads. As the sales reps use the platform, it records the time, mode, and other metadata about the interaction, as well as whether that lead progresses in the sales pipeline. The data is used to train an AI model to recommend the best time and mode of communication to contact similar leads. Here, network effects may increase the usefulness of the tool as more users come onto the platform, which may drive down user acquisition costs.
Alternatively, securing a strategic partnership where another entity has already established data collection pipelines may further drive down costs. Another of our companies, Tractable, which applies computer vision to automate the work of an auto insurance adjustor, is partnering with several leading auto insurers to access images of damaged cars and does not have to invest in distributing an app to individual car owners.
Cost of storage and management
On the data storage and access front, startups face another cost issue. In addition to the data you have collected, you may need your customers to provide additional contextual data to enrich your model. Many sectors have only recently begun to digitize, so even if a potential customer has the data you need to enrich your model, don’t assume that data will be readily accessible. In order to use it, you may have to spend significant manpower on low-margin data preparation.
Furthermore, if that data is spread across different systems and silos, you may have to spend a significant amount of time building each integration before the model can be fully functional. Some industries are built around monolithic and idiosyncratic tech stacks, making integrations difficult to reuse across customers. If integration service providers are not available, your AI startup may find itself mired in building custom integrations for every new customer before it can deploy its AI system. The way data is structured might also vary from one customer to the next, requiring AI engineers to spend additional hours normalizing the data or converting it to a standardized schema so the AI model can be applied. Building up a library of common integrations will drive down costs as you reuse them with new customers.
Cost of training
Most approaches to AI model building require that you tag and annotate data, which presents one of the biggest and most variable costs to AI startups. If the examples are straightforward or commonly understood enough that a layperson could perform the annotation – for example, drawing a box around all the apples in a picture — you could use an outsourced labor service such as Mechanical Turk or Figure8 to do the annotation.
Sometimes, however, the annotation requires more specialized knowledge and experience, such as determining the quality and ripeness of an apple based on just visual cues, or whether a patch of rust on an oil rig is dangerous. For this more specialized labor you may have to build an internal expert annotation team and pay them higher wages. Depending on how you do the annotation, you may also have to build your own annotation workflow tools, although companies such as Labelbox are now emerging to offer such tools.
In some AI applications, the end user is the most effective annotator, and you can offload the annotation costs by designing the product so that users label the data as they interact with your product. Constructor, a portfolio company of ours that offers AI-powered site search for e-commerce, observes what products users actually click on and purchase with each search term, enabling them to optimize search results for higher sales. This kind of annotation is impossible to do artificially with either an outsourced or expert search service and saves Constructor what might otherwise be significant annotation costs.
Even after you’ve trained your model at high accuracy, you will occasionally need humans to intervene when the model is uncertain about how to interpret a new input. Depending on how the model delivers value to the end user, that user herself may make the correction or annotation to the model, or your startup can handle the exceptions by employing a quality control “AI babysitter.” In cases where the environment you’re modeling is volatile and changes at a high and regular rate, you may want to retain at steady-state a team of annotators to update the model with the new data as needed.
Scaling AI businesses
The first successful AI businesses came to market offering AI-free workflow tools to capture data that eventually trained AI models and enhanced the tools’ value. These startups were able to achieve software margins early on, since the data and AI were secondary to the startup’s value proposition. As we move to more specialized applications of AI, however, the next wave of AI startups will face higher startup costs and will require more human labor to provide initial value to their customers, making them resemble lower-margin services businesses.
Getting to a critical mass of customers and data will eventually drive down the unit economics and build that crucial compounding defensibility, but many startups don’t know exactly how far ahead that point may be and what they need to do get there faster. The best AI startups will understand which levers can optimize on that pathway and use them deliberately to make the right investments and scale quickly.