Summarization is useful whenever you need to condense a big number of documents into smaller texts. Anyone who browsed scientific papers knows the value of abstracts – unfortunately, in general documents don’t share this structure. This article is an overview of some text summarization methods in Python.
What are the types of automatic text summarization? The primary distinction of text summarization methods is whether they use the parts text itself, or can they generate new words and sentences. This splits the methods into two groups: extractive and abstractive. Extractive methods create summaries that are selected from the input text (think the highlighting text in lecture notes), whereas abstractive methods can create sentences that don’t necessarily come from the summarized text.
The outlined approaches have their own pros and cons. Extractive methods may run into problems with text that doesn’t have sentences that represent meaning well. On the other hand, abstractive summarization in itself is a much harder task because it needs trained models – most of the methods use Sequence to Sequence models, like the ones in Machine Translation. Those methods also need supervised data, in fact, lots of it, and the datasets are scarce. For other approaches to grouping, summarization methods see Sparck-Jones’ review article.
Approaches for extractive summarization
This approach was historically the first one. It comes from the observation that for scientific articles, most informative sentences tend to be either in the start or end of the document. Some of these methods are also augmented using scores for words that are calculated from their frequency. For these methods see sumy, specifically Edmundson and Luhn methods.
The most popular approach for summarization. It consists of creating a graph on documents units (most methods use sentences as base units) and then selecting nodes with PageRank. PageRank algorithm calculates node ‘centrality’ in the graph, which turns out to be useful in measuring relative information content of sentences. The graph tends to be constructed using Bag of Words features of sentences (typically tf-idf) – edge weights correspond to cosine similarity of sentence representations.
Gensim uses this approach
Assume we have a way of representing sentences and documents in some feature space. Centroid is defined as the whole document’s vector. Summary sentences are selected by taking sentences which have vectors similar to centroid vector.
The cool thing is that this can work for different representations. In fact, originally this approach was suggested for Bag of Words model (though it used some preprocessing on centroid), but recently it was revived in Centroid-based Text Summarization through Compositionality of Word Embeddings paper using word embeddings.
The paper uses word embeddings to represent sentences in the following way: the centroid vector is calculated as the mean of word embeddings of most important words (these are selected using tf-idf scores), and sentence embeddings are also calculated as means of embeddings of the words they contain.
The text-summarizer repository and package contain methods for centroid-based summarization. It supports loading word embeddings from gensim.
A few words on abstractive approaches
Unfortunately, there is no easy to use an abstractive method at the time of writing this article. In fact, there are very few pretrained models for abstractive summarization. Another problem comes from the fact that these methods tend to use Recurrent Neural Networks or other sequence models, and also perform predictions word by word using beam search, so they need lots of computations.
Interestingly Pointer-Generator Networks, one of mostly cited seq2seq based approaches to summarization, is a hybrid between abstractive and extractive methods – it uses a clever modification of attention mechanism that enables copying input words. Incidentally, this also helps with out of vocabulary words, which are problematic for sequence models.
Text summarization, either extractive or abstractive, tends to be evaluated using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric. ROUGE relates to BLEU metric as recall relates to precision – formally, ROUGE-n is recall between candidate summary n-grams and n-grams from reference summary.
Interestingly, under ROUGE metric positional approaches (Lead method, which just extracts several sentences from the beginning of document) beat any other approach. This suggests inadequacy of ROUGE metric, but unfortunately, there is no consensus on a method that would more realistically measure summary usefulness measured by humans.
There’s been some progress in machine-learning approaches to summarization since 2017, as can be witnessed comparing methods available now to Text Summarization in Python: Extractive vs. Abstractive techniques revisited from 2017 (there were no pretrained abstractive models available then). Unfortunately, abstractive methods still run into big problems with generalization, but that might change, according to a recent NLP trend for multitasking models. As for now, extractive methods are most useful and if you want to summarize your text in Python, then sumy, gensim, and text summarizers methods are the first tools you should check. On the other hand, if you want to take a deeper dive, a more comprehensive list of text summarization methods can be found in Dragomir Radev’s Lectures.
Written by: Jakub Bartczuk, Senior Data Scientist at Semantive.