Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday September 20 2019, @06:23AM   Printer-friendly
from the what-about-ernie? dept.

Submitted via IRC for SoyCow2718

Investigating the self-attention mechanism behind BERT-based architectures

BERT, a transformer-based model characterized by a unique self-attention mechanism, has so far proved to be a valid alternative to recurrent neural networks (RNNs) in tackling natural language processing (NLP) tasks. Despite their advantages, so far, very few researchers have studied these BERT-based architectures in depth, or tried to understand the reasons behind the effectiveness of their self-attention mechanism.

Aware of this gap in the literature, researchers at the University of Massachusetts Lowell's Text Machine Lab for Natural Language Processing have recently carried out a study investigating the interpretation of self-attention, the most vital component of BERT models. The lead investigator and senior author for this study were Olga Kovaleva and Anna Rumshisky, respectively. Their paper pre-published on arXiv and set to be presented at the EMNLP 2019 conference, suggests that a limited amount of attention patterns are repeated across different BERT sub-components, hinting to their over-parameterization.

"BERT is a recent model that made a breakthrough in the NLP community, taking over the leaderboards across multiple tasks. Inspired by this recent trend, we were curious to investigate how and why it works," the team of researchers told TechXplore via email. "We hoped to find a correlation between self-attention, the BERT's main underlying mechanism, and linguistically interpretable relations within the given input text."

BERT-based architectures have a layer structure, and each of its layers consists of so called "heads." For the model to function, each of these heads is trained to encode a specific type of information, thus contributing to the overall model in its own way. In their study, the researchers analyzed the information encoded by these individual heads, focusing on both its quantity and quality.

"Our methodology focused on examining individual heads and the patterns of attention they produced," the researchers explained. "Essentially, we were trying to answer the question: "When BERT encodes a single word of a sentence, does it pay attention to the other words in a way meaningful to humans?"

The researchers carried out a series of experiments using both basic pretrained and fine-tuned BERT models. This allowed them to gather numerous interesting observations related to the self-attention mechanism that lies at the core of BERT-based architectures. For instance, they observed that a limited set of attention patterns is often repeated across different heads, which suggests that BERT models are over-parameterized.

"We found that BERT tends to be over-parameterized, and there is a lot of redundancy in the information it encodes," the researchers said. "This means that the computational footprint of training such a large model is not well justified."


Original Submission

Related Stories

Hugging Face Raises $15 Million to Build "Open Source Community for Conversational AI" 6 comments

Hugging Face raises $15 million to build open source community for cutting-edge conversational AI

Hugging Face has announced the close of a $15 million series A funding round led by Lux Capital, with participation from Salesforce chief scientist Richard Socher and OpenAI CTO Greg Brockman, as well as Betaworks and A.Capital.

New-York based Hugging Face started as a chatbot company, but then began to use Transformers, an approach to conversational AI that's become a foundation for state-of-the-art algorithms. The startup expands access to conversational AI by creating abstraction layers for developers and manufacturers to quickly adopt cutting-edge conversational AI, like Google's BERT and XLNet and OpenAI's GPT-2 or AI for edge devices. More than 1,000 companies use Hugging Face solutions today, including Microsoft's Bing.

The funding will be used to grow the Hugging Face team and continue development of an open source community for conversational AI. Efforts will include making it easier for contributors to add models to Hugging Face libraries and the release of additional open source tech, like a tokenizer.

Also at TechCrunch.

Related: Facebook Open sources PyText NLP Framework
Mozilla Expands Common Voice Database to 18 Languages, With More on the Way
Investigating the Self-Attention Mechanism Behind BERT-Based Architectures


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2) by FatPhil on Friday September 20 2019, @08:47AM (3 children)

    by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Friday September 20 2019, @08:47AM (#896431) Homepage
    > Our manually constructed list of negation words consisted of the following words:
    > neither, nor, not, never, none, don’t, won’t, didn’t, hadn’t, haven’t, can’t, isn’t, wasn’t, shouldn’t, couldn’t, nothing, nowhere.

    Tell me it ain't so, except for a few wierdos we shan't mention surely nobody could think that without "wouldn't" there would be zero problems, no?

    But apart from that, after reading the summary the article and the paper, I still haven't got a fricken clue what is doing what with what and how - let alone why. What does this thing *do*? For me, or the man on the street, I don't care about your abstract academic corpus of sentences and test suites. Can it pass a 7-year-old's reading comprehension test yet? If not, shut up, and go back to your labs.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 0) by Anonymous Coward on Friday September 20 2019, @10:06AM

      by Anonymous Coward on Friday September 20 2019, @10:06AM (#896445)

      I woudn't've thought so.

    • (Score: 2) by Rupert Pupnick on Friday September 20 2019, @12:02PM (1 child)

      by Rupert Pupnick (7277) on Friday September 20 2019, @12:02PM (#896467) Journal

      Anyone know what BERT is without reading TFA? I bet it's not Bit Error Rate Tester.

      • (Score: 2) by hendrikboom on Friday September 20 2019, @03:12PM

        by hendrikboom (1125) Subscriber Badge on Friday September 20 2019, @03:12PM (#896523) Homepage Journal

        Having read the summary, the article, and the linked "for further information" article, I still don't know what BERT is or what self-attention is. But I do get the impression that it has something to do with linguistic connexions between various parts of sentences.

  • (Score: 2) by NotSanguine on Friday September 20 2019, @08:22PM

    by NotSanguine (285) <NotSanguineNO@SPAMSoylentNews.Org> on Friday September 20 2019, @08:22PM (#896637) Homepage Journal

    From TFS:

    Aware of this gap in the literature, researchers at the University of Massachusetts Lowell's Text Machine Lab for Natural Language Processing have recently carried out a study investigating the interpretation of self-attention, the most vital component of BERT models.

    Upon skimming TFS, They mention architectures and leaderboards, which made me think it's gaming. But then they mention NLP, which I interpreted as Neuro-Linguistic Programming [wikipedia.org], which turns out not to be what they mean (Natural Language Processing).

    The sentence from TFS above gives me the most clues (I think), that they're working on one or more of the following:
    1. Applying Semantics [wikipedia.org]/Semiotics [wikipedia.org], Epistemology [wikipedia.org], or both to (computer) programmatic constructs;
    2. Applying Semantics/Semiotics, Epistemology, or both to human use of language;
    3. Interpreting poorly scanned/handwritten documents to improve OCR mechanisms (which isn't necessarily mutually exclusive to (1) above)

    Then again, they may well be doing something completely different. There's really no way to tell without RTFA.

    Perhaps some kind soul will do so and explain all of this.

    --
    No, no, you're not thinking; you're just being logical. --Niels Bohr
(1)