Summarizing text is a task at which machine learning algorithms are improving, as evidenced by a recent paper published by Microsoft. That’s good news — automatic summarization systems promise to cut down on the amount of message-reading enterprise workers do, which one survey estimates amounts to 2.6 hours each day.
Not to be outdone, a Google Brain and Imperial College London team built a system — Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence, or Pegasus — that leverages Google’s Transformers architecture combined with pretraining objectives tailored for abstractive text generation. They say it achieves state-of-the-art results in 12 summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills, and that it shows “surprising” performance on low-resource summarization, surpassing previous top results on six data sets with only 1,000 examples.
As the researchers point out, text summarization aims to generate accurate and concise summaries from input documents, in contrast to executive techniques. Rather than merely copy fragments from the input, abstractive summarization might produce novel words or cover principal information such that the output remains linguistically fluent.
Transformers are a type of neural architecture introduced in a paper by researchers at Google Brain, Google’s AI research division. As do all deep neural networks, they contain functions (neurons) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection — that’s how all AI models extract features and learn to make predictions. But Transformers uniquely have attention. Every output element is connected to every input element, and the weightings between them are calculated dynamically.
The team devised a training task in which whole, and putatively important, sentences within documents were masked. The AI had to fill in the gaps by drawing on web and news articles, including those contained within a new corpus (HugeNews) the researchers compiled.
In experiments, the team selected their best-performing Pegasus model — one with 568 million parameters, or variables learned from historical data — trained on either 750GB of text extracted from 350 million web pages (Common Crawl) or on HugeNews, which spans 1.5 billion articles totaling 3.8TB collected from news and news-like websites. (The researchers say that in the case of HugeNews, a whitelist of domains ranging from high-quality news publishers to lower-quality sites was used to seed a web-crawling tool.)
Pegasus achieved high linguistic quality in terms of fluency and coherence, according to the researchers, and it didn’t require countermeasures to mitigate disfluencies. Moreover, in a low-resource setting with just 100 example articles, it generated summaries at a quality comparable to a model that had been trained on a full data set ranging from 20,000 to 200,000 articles.