If you’ve ever performed a Google search for a celebrity, a famous landmark, or a product before, you’ve likely encountered the infoboxes that sometimes sit to the right of the results page. They’re filled with information from Google’s Knowledge Graph, an entities database used to enhance search results on the web and in smart speakers like Google Home. Most of the Knowledge Graph‘s more than 1.6 billion facts are crowdsourced from human teams, who regularly comb through millions of websites for answers to common questions about people, places, and things.
But if you ask Mike Tung, there’s a better way to do it.
He’s the founder of Diffbot, a Mountain View, California-based startup whose mission is to convert the web’s unstructured data into structured data — or, as Tung put it, “extracting knowledge in an automated way from documents.” Diffbot is publicly launching this week after a years-long private pilot program.
“We’re trying to build the first comprehensive map of human knowledge … by analyzing every page on the internet,” Tung told VentureBeat in a phone interview.
It’s a lofty goal, but Diffbot, which grew out of Tung’s artificial intelligence (AI) work at Stanford, spent five years building the tools necessary to accomplish it. Leveraging a combination of computer vision and natural language processing, Diffbot’s web crawler can parse the layout and structure of virtually any webpage — about 90 percent of the web and 20 or so page types, Tung claims — for facts, figures, and abstract relationships between objects. (Typical examples include a product page on Amazon.com or an executive bio on a company’s webpage.)
“We call it knowledge-as-a-service,” Tung said. “Right now, 30 percent of a knowledge workers’ job is data gathering. There’s a big opportunity in the market for a horizontal knowledge graph — a database of information about people, businesses, and things.”
Data extracted by Diffbot’s crawler feeds into an enormous database called the Diffbot Knowledge Graph, or DKG, comprising more than a trillion facts and 10 billion entities. (Tung said it’s adding facts at a rate of 130 million per month.) Core categories include people (skills, employment history, education, social profile), companies, locations (mapping data, addresses, business types, zoning information), articles (every news article, dateline, byline from anywhere on the web, in any language), discussions (chats, social sharing, and conversations), and images (organized using image recognition and metadata collection).
All of this is accessible via API calls and manipulable with Diffbot DQL, the company’s custom query syntax. Clients can view results from the DKG in a list, map, or table layout in Diffbot’s web-based UI, or from within third-party content management systems or analytics platforms.
Among those clients are Microsoft, eBay, Yandex, and DuckDuckGo, which are using it to enhance the quality of their search results. Other customers include Cisco, Salesforce, Crunchbase, Hubspot, Adobe, Instapaper, and Onswipe.
“Simply put, Diffbot is using the power of AI on a scale we’ve never seen before,” said Aydin Senkut, founder and managing director of Felicis Ventures, one of Diffbot’s investors. “It’s the first profitable AI company on record; they are the ‘secret ingredient’ powering applications from many of the largest companies in tech.”
In a demo, Tung showed me how it worked. Say you wanted to perform a one-off search for a brand of shoe. In Diffbot’s web dashboard, you’d type the sneaker brand into a Google-like search bar and hit enter; within milliseconds you’d get a product profile synthesized from sources around the web.
Looking for news articles instead? Same process: Typing in an author’s name yields every article they’ve ever published online (across languages, too). Searching for a person, on the other hand, pulls up a CV-like work history pieced together from dozens (or hundreds) of bios, articles, and publicly available profiles.
One of Diffbot’s unique strengths is its ability to quickly drill down by entity, Tung explained. It’s helpful in tasks like job recruitment — the appropriate DQL string (e.g., “type:Person employments.employer.name:’Diffbot’”) can collate every employee at a given company, along with their job title, skills, educational background, and social media profiles all in one place.
“This is the holy grail of machine learning — capturing all the world’s knowledge in one place,” Tung said.
Google’s Knowledge Graph has historically faced criticism for lacking attribution and omitting sources of conflicting information, but Tung said that Diffbot’s automated approach kills two birds with one stone. Not only is Diffbot more comprehensive than manually curated databases like Google’s Knowledge Graph, but it’s more accurate, too — Diffbot’s crawler regularly refreshes the DKG with new information and its machine learning algorithms are smart enough to pass over sites with histories of producing “logically inconsistent” facts.
“That’s one of the reasons why we fuse information together from different sources,” Tung said. “Our scale is such that there’s minimal potential for errors. We’d bet the business on it.”
Diffbot launched in 2008 and counts 28 employees among its core staff of engineers and data scientists. It previously raised $10 million in a funding round led by VC Tencent, Felicis Ventures, and Amplify Ventures.