Tutorial

Understanding the BART Model for Accurate Text Summarization

Updated on April 25, 2025

Understanding the BART Model for Accurate Text Summarization

Introduction

Self-supervised learning has driven major progress in natural language processing (NLP), allowing models to learn useful representations from large amounts of unlabelled text. Among these approaches, denoising autoencoders—which train models to reconstruct original text after masking out random words—have shown particularly strong results.

By learning to predict missing parts of a sentence, these models develop a deep understanding of grammar, context, and meaning. Recent research has further improved these methods by experimenting with how words are masked, the order in which predictions are made, and the context provided during training. While these improvements have pushed performance even further, many of the resulting models tend to be limited to specific tasks like span prediction or generation, restricting their broader usefulness. The BART model was introduced to address this limitation—offering a more general, flexible approach to self-supervised training that can handle a wide range of NLP tasks with high performance.

Prerequisites

In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.

If you do not have access to a GPU, we suggest using DigitalOcean GPU Droplets.

To help you get started with Python code, we recommend following this beginner’s guide to set up your system, which will prepare you to run beginner tutorials.

What is the BART Transformer Model in NLP?

The BART research paper presents a pre-training technique that integrates Bidirectional and Auto-Regressive Transformers. As a denoising autoencoder employing a sequence-to-sequence framework, BART proves valuable across diverse applications. Its pretraining process involves two stages: first, text is corrupted through a chosen noising function; second, a sequence-to-sequence model is trained to recover the original text.

BART’s Transformer-based neural machine translation architecture can be seen as a generalization of BERT (due to the bidirectional encoder), GPT (With the left-to-right decoder), and many other contemporary pre-training approaches.

Source

In addition to its strength in comprehension tasks, BART’s effectiveness increases with fine-tuning for text generation. It generates new state-of-the-art results on various abstractive conversation, question answering, and summarization tasks, matching the performance of RoBERTa with comparable training resources on GLUE and SQuAD.

Architecture

Except changing the ReLU activation functions to GeLUs and initializing parameters from (0, 0.02), BART follows the general sequence-to-sequence Transformer design (Vaswani et al., 2017). There are six layers in the encoder and decoder for the base model and twelve layers in each for the large model.

Similar to the architecture used in BERT, the two main differences are that (1) in BERT, each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder (as in the transformer sequence-to-sequence model); and (2) in BERT, an additional feed-forward network is used before word prediction, whereas in BART, there isn’t.

Pre-training BART

To train BART, we first corrupt documents and then optimize a reconstruction loss, which is the cross-entropy between the decoder’s output and the original document. In contrast to conventional denoising autoencoders, BART may be used for any type of document corruption.

The worst-case scenario for BART is when all source information is lost, which becomes analogous to a language model. The researchers try out several new and old transformations, but they also believe there is much room for creating even more unique alternatives.
In the following, we will outline the transformations they performed and provide some examples. Below is a summary of the transformations they used, and an illustration of some of the results is provided in the figure.

Token Masking: Following BERT, random tokens are sampled and replaced with MASK elements.
Token Deletion: Random tokens are deleted from the input. In contrast to token masking, the model must predict which positions are missing inputs.
Text Infilling: Several text spans are sampled, with span lengths drawn from a Poisson distribution (λ = 3). Each span is replaced with a single MASK token. Text infilling teaches the model to predict how many tokens are missing from a span.
Sentence Permutation: A document is divided into sentences based on full stops, and these sentences are shuffled in random order.
Document Rotation: A token is chosen uniformly at random, and the document is rotated to begin with that token. This task trains the model to identify the start of the document.

Source

Fine-tuning BART

Several potential uses for the representations BART generates in subsequent processing steps exist:

Sequence Classification Tasks: For sequence classification problems, the same input is supplied to the encoder and decoder, and the final hidden states of the last decoder token are fed into the new multi-class linear classifier.

Source

Token Classification Tasks: Both the encoder and decoder take the entire document as input, and from the decoder’s top hidden state, a representation of each word is derived. The token’s classification relies on its representation.
Sequence Generation Tasks: For sequence-generating tasks like answering abstract questions and summarizing text, BART’s autoregressive decoder allows for direct fine-tuning. Both of these tasks are related to the pre-training goal of denoising since they involve the copying and subsequent manipulation of input data. Here, the input sequence serves as input to the encoder, while the decoder generates outputs in an autoregressive manner.
Machine Translation: The researchers investigate the feasibility of using BART to enhance machine translation decoders for translating into English. Using pre-trained encoders has been proven to improve models, while the benefits of incorporating pre-trained language models into decoders have been more limited. Using a set of encoder parameters learned from bitext, they demonstrate that the entire BART model can be used as a single pretrained decoder for machine translation.

More specifically, they swap out the embedding layer of BART’s encoder with a brand new encoder using random initialization. When the model is trained from start to end, the new encoder is trained to map foreign words into an input that BART can then translate into English. In both stages of training, the cross-entropy loss is backpropagated from the BART model’s output to train the source encoder.

In the first stage, they fix most of BART’s parameters and only update the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART’s encoder first layer. Second, they perform a limited number of training iterations on all model parameters.

Source

BART Model for Text Summarization

It takes much time for a researcher or journalist to sift through all the long-form information on the internet and find what they need. You can save time and energy by skimming the highlights of lengthy literature using a summary or paraphrase synopsis.

The NLP task of summarizing texts may be automated with the help of transformer models. Extractive and abstractive techniques exist to achieve this goal. Summarizing a document extractively involves finding the most critical statements in the text and writing them down.

One may classify this as a type of information retrieval. More challenging than literal summarizing is abstract summarization, which seeks to grasp the whole material and provide a paraphrased text to sum up the key points. The second type of summary is carried out by transformer models such as BART.

HuggingFace gives us quick and easy access to thousands of pre-trained and fine-tuned weights for Transformer models, including BART. You can choose a tailored BART model for the text summarization assignment from the HuggingFace model explorer. Each submitted model includes a detailed description of its configuration and training.

The beginner-friendly bart-large-cnn model deserves a look, so let’s look at it. Either use the HuggingFace Installation page or run pip install transformers to get started. Next, we’ll follow these three easy steps to create our summary:

from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

The Transformers model pipeline should be loaded first. A module in the pipeline is defined by naming the task and the model. The term “summarization” is used, and the model is referred to as “facebook/bart-large-xsum.” If we want to attempt something different than the standard news dataset, we can use the Extreme Summary (XSum) dataset. The model was trained to generate one-sentence summaries exclusively.

The last step is constructing an input sequence and putting it through its paces using the summarizer() pipeline. In terms of tokens, the summary length can also be adjusted using the function’s optional max_length and min_length arguments.

from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

Output:

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

Another option is to use BartTokenizer to generate tokens from text sequences and BartForConditionalGeneration for summarizing.

# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

# As a pre-trained model, " bart-large-cnn" is optimized for the summary job.  
# The **from_pretrained()** function is used to load the model, as seen below.

# Tokenizer and model loading for bart-large-cnn  tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Assume you have to summarize the same text as in the example above. You can make use of the tokenizer’s batch_encode_plus() feature for this purpose. When called, this method produces a dictionary that stores the encoded sequence or sequence pair and any other information provided.
How can we restrict the shortest possible sequence that can be returned?

In batch_encode_plus(), set the value of the max_length parameter. To get the IDs of the summary output, we feed the input_ids into the model.generate() function.

# Transmitting the encoded inputs to the model.generate() function
inputs = tokenizer.batch_encode_plus([ARTICLE],return_tensors='pt')
summary_ids =  model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)

The summary of the original text has been generated as a sequence of IDs by the model.generate() method. The function model.generate() has many parameters, among which are:

input_ids: The sequence used as a prompt for the generation.
max_length: The max length of the sequence to be generated. Between min_length and infinity. Default to 20.
min_length: The minimum length of the sequence to be generated. Between 0 and infinity. Default to 0.
num_beams: Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
early_stopping: if set to True, beam search is stopped when at least num_beams sentences are finished per batch.

The decode() function can be used to transform the ID sequence into plain text.

# Decoding and printing the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

The decode() converts a list of token IDs into a list of strings. It accepts several parameters, among which we will mention two of them:

token_ids: List of tokenized input IDs.
skip_special_tokens: Whether or not to remove special tokens in the decoding.

As a result, we get this:

Liana Barrientos, 39, is charged with two counts of offering a false instrument for filing in the first degree. In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. At one time, she was married to eight men at once, prosecutors say.

Summarizing Documents with BART using ktrain

Ktrain is a Python package that reduces the amount of code required to implement machine learning. Wrapping TensorFlow and other libraries, it aims to make cutting-edge ML models accessible to non-experts while satisfying the needs of experts in the field. With ktrain’s streamlined interface, you can handle a wide variety of problems with as little as three or four “commands” or lines of code, regardless of whether the data being worked with is textual, visual, graphical, or tabular.

Using a pretrained BART model from the transformers library, ktrain can summarize text. First, we’ll create a TransformerSummarizer instance to perform the actual summarizing. (Please note that the installation of PyTorch is necessary to use this function.)

from ktrain.text.summarization import TransformerSummarizer
ts = TransformerSummarizer()

Let’s go ahead and write up an article:

article = """ Saturn orbiter and Titan atmosphere probe. Cassini is a joint NASA/ESA project designed to accomplish an exploration of the Saturnian system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini is scheduled for launch aboard a Titan IV/Centaur in October of 1997.After gravity assists of Venus, Earth and Jupiter in a VVEJGA trajectory, the spacecraft will arrive at Saturn in June of 2004. Upon arrival, the Cassini spacecraft performs several maneuvers to achieve an orbit around Saturn. Near the end of this initial orbit, the Huygens Probe separates from the Orbiter and descends through the atmosphere of
Titan. The Orbiter relays the Probe data to Earth for about 3 hours while the Probe enters and traverses the cloudy atmosphere to the surface. After the completion of the Probe mission, the Orbiter continues touring the Saturnian system for three and a half years. Titan synchronous orbit trajectories will allow about 35 flybys of Titan and targeted flybys of Iapetus, Dione and Enceladus. The objectives of the mission are threefold: conduct detailed studies of Saturn's atmosphere, rings and magnetosphere; conduct close-up studies of Saturn's satellites, and characterize Titan's atmosphere and surface."""

We can now summarize this article by using TransformerSummarizer instance:

ts.summarize(article)

Conclusion

In this article, we explored the BART (Bidirectional and Auto-Regressive Transformers) model. By examining both theoretical insights and practical code examples, we demonstrated how BART’s powerful seq2seq capabilities can be leveraged for tasks like text generation, summarization, and translation.

With the growing demand for AI/ML solutions and scalable infrastructure, DigitalOcean offers the perfect environment to host and scale your machine learning workloads. With their powerful GPU Droplets and managed Kubernetes solutions, you can efficiently deploy and manage Transformer models like BART for real-time inference and training tasks. Whether you’re running intensive NLP models or handling large-scale AI projects, DigitalOcean provides the resources you need to scale quickly, ensuring both cost-efficiency and performance.