Summarization

References and useful ressources

Methods

  1. Extractive - directly copies salient sentences from the source document and combine them as the output.
  2. Abstractive - imitates a human that comprehends a source document and writes a summary output based on the salient concepts of the source document.
  3. Hybrid - attempts to combine the best of both approaches by rewriting summary based on a subset of salient content extracted from the source document.

We don't handle short and long document summary the same way. As data grows, the essence becomes harder to capture. Making sure the summary is a good representation of the larger text (Books, Large number of small texts merged as one document, ...) can be challenging.

Algorithms

Map Reduce

summarization_map_reduce.png

  1. split into chunks of a given size
  2. summarize each chunk
  3. merge several chunk summaries (depending on the setting)
  4. summarize the new "super-chunk"
  5. apply recursively until there is one “root” node

⚠️ Splitting the text into chunks with no regard for the logical and structural flow of the text can cause issues in summarization, because some key points may be longer than others and hence important information may be lost. However, the nature of this approach allows to apply parallelisation to speed up the process.

Refine

summarization_refine.png

  1. split into chunks (not necessarily of the same size)
  2. summarize the 1st non-summarized chunk
  3. merge the summary with the next non-summarized chunk
  4. summarize the new "super-chunk"
  5. apply sequentially until the final chunk so we end-up with one “root” node

⚠️ The sequential nature of this approach means that it cannot be parallelized and takes far longer than recursive methods. Also, research suggests that the meaning from the initial parts may be over-represented in the final summary.

Approaches

Summarizing long docs (with connecting ideas)

  1. split into chunks (not necessarily of the same size)
  2. get Titles and Summaries for each chunk
  3. get embeddings for chunks
  4. Topic Modelling: group similar embedding together
  5. Topic Modelling: detect Topics from the chunks
  6. get Titles and Summaries for each Topic using their associated chunks
  7. optionally apply Map Reduce or Refine methods on the Topic summaries so we end-up with one "root" node

This approach will produce a hierarchical summarization with an accurate semantic capture retaining the essential information.

⚠️

Summarizing large number of small docs (reviews, posts, product descriptions, new articles)

This kind of documents doesn't have a hierarchy or a structural flow. So we need to carefully select define what we want to capture as the "essence" of the documents and what strategy to apply tho reach this goal.

summarization_many_small_docs.png

  1. Sample by using highly weighted texts (based on indicators such as "X people found that review helpful" or "X people shared this article") or taking a stata amongst texts (e.g. good, bad, neutral...)
  2. then apply Map Reduce or Refine...

Summarizing small number of docs when it fits in the context window

If the text fits into the context window, the best strategy might be to send all the chunks at once. No need for fancy approaches, unless we want to remove titles, sub-titles etc. or cut the costs.

summarization_small_number_of_docs.png

  1. split into chunks or not (it depends if we use chunks for other use-cases)
  2. summarize the whole text (the sum of chunks, if chunked)

⚠️ In this case we can use StuffDocumentsChain to summarize into one call with all in it (considering the context size). It is very fast comparing to map reduce or refine.

Summarizing small number of docs when it DOESN'T fit in the context window

summarization_few_large_docs.png

  1. split into chunks (not necessarily of the same size)
  2. summarize each document (directly if it fits the context window, or using Map Reduce, Refine or similar algorithm otherwise)
  3. merge the summaries
  4. summarize the new "super-chunk"