AI Weekly: The promise and shortcomings of OpenAI’s GPT-3


I typically think of the dog days of summer as a time when news slows down. It’s typically when a lot of people take time off work, and the lull leads local news stations to cover inconsequential things like cat shows or a little baby squirrel on a little baby Jet Ski. But these are not typical times.

Fallout surrounding bias continues at Facebook, as multiple news outlets reported that Instagram’s content moderation algorithm was 50% more likely to flag and disable the accounts of Black users than White users. Facebook and Instagram are now creating teams to examine how algorithms impact the experiences of Black, Latinx, and other specific groups of users.

Also this week: Executives from Amazon, Google, and Microsoft gave more than 30 recommendations to leaders in Washington for the U.S. to maintain an edge over other nations in AI. Recommendations include the idea of recruiting AI practitioners into a reserve corps for part-time government work and creating an accredited academy for the U.S. government to train AI talent.

But arguably the biggest story this week was the beta release of GPT-3, a language model capable of a great range of tasks like summarization, text generation to write articles, or translation. Tests made especially to analyze GPT-3 also found it can also complete a range of other tasks like unscramble words and use words in sentences that it’s only seen defined once.

In recent weeks, OpenAI extended access to an API and the language model with 175 billion parameters trained on a corpus of text from the web, which includes about a trillion words. Apps like a layout generator that creates code from natural language descriptions got a lot of attention, as did apps for answering people’s questions or creating American history test questions and answers. A generator that identifies the relationship between objects in the world offered a potential application to help robots or other forms of AI to better understand the world. One early GPT-3 user had a chat about God and existence and the universe he felt was so profound that “you will become another person after reading it.” A particularly gushing Bloomberg story titled “Artificial intelligence is the hope 2020 needs” suggested that GPT-3 could end up becoming one of the biggest news stories of 2020.

Some discussion around the release of GPT-3 also raised the question of why OpenAI seems less concerned about sharing the much larger GPT-3 than it was about GPT-2, a model that OpenAI controversially initially chose not to share publicly due to its potential negative impact on things like the spread of fake news.

The timing of the release of large language models has been in line with OpenAI’s broader business plan. For context, the GPT-2 release came a month before OpenAI changed its business structure and created a for-profit company. GPT-3 was released less than two weeks before the introduction of the OpenAI API to commercialize its AI.

Emily Bender is a professor, a linguist, and a member of the University of Washington’s NLP group. Last month a paper she coauthored about large language models like GPT-3, which argues that hype around large language models shouldn’t mislead people into believing they’re capable of understanding or meaning, won an award from the Association of Computational Linguistics conference.

“While large neural language models may well end up being important components of an eventual full-scale solution to human-analogous natural language understanding, they are not nearly-there solutions to this grand challenge,” the paper reads.

She hasn’t tested GPT-3 personally but said from what she’s seen that GPT-3 is impressive, but roughly the same in architecture as GPT-2. The big difference is it’s massive.

“It’s shiny and big and flashy and it’s not different in kind, either in the overall approach or in the risks that it brings along,” she said. “I think that there’s a fundamental problem in an approach to what gets called artificial intelligence that relies on data sets that are larger than humans can actually manually verify.”

Circulating among the free publicity for OpenAI early access users are generating are some examples that demonstrate its predictable bias.  Facebook AI head Jerome Pesenti found a rash of negative statements from AI created to generate humanlike tweets about Black people, Jewish people, and women. Of course that’s not a surprise. Tests included in the release of paper in late May found that GPT-3 demonstrates gender bias, and is most likely to give Asian people a high sentiment analysis and Black people a low sentiment analysis score, particularly among smaller versions of the model. OpenAI analysis also demonstrated shortcomings in specific tasks like word-in-context analysis (WiC) and RACE, a set of middle school and high school exam questions.

Tests earlier this year found that many popular language models trained with a large corpus of data like Google’s BERT and GPT-2 demonstrate several forms of bias. Bender, who teaches an NLP ethics course at the University of Washington, said there’s no such thing as an unbiased data set or a bias-free model, and that even carefully created language data sets can carry subtler forms of bias, but some best practices could reduce bias in large data sets.

OpenAI is implementing testing in beta as a safeguard, which may help unearth issues, a spokesperson said, adding that the company is applying toxicity filters to GPT-3. The spokesperson declined to share additional information about what the filters accomplish but said more details will be shared in the weeks ahead.

It’s understandable that the promise GPT-3 represents generates marvel in some people and brings people closer to the idea of a general model that can do virtually anything with just a few samples of training data. OpenAI CEO Sam Altman tweeted that a 10-year-old boy he showed GPT-3 to said he wanted to enter the AI field in a matter of seconds.

Altman also said in a tweet Sunday that “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.”

The OpenAI paper said the approach taken to characterize some attributes of the model was inspired by the model cards for model reporting method created by Google AI ethics researchers.

Alongside the need to adopt data sheets or data statements to better understand the contents of data sets, Bender emphasized that more testing is needed for the NLP field to be able to really understand when models are demonstrating an understanding or other grand challenges.

“What’s happened culturally recently … within NLP in the last maybe 10-15 years, there’s been a lot of emphasis on valuing models and model building, and the only value assigned to work around evaluation metrics and task design and annotation is as subsidiary to the model building to allow the model builders to show how good their models are,” she said. “And that’s an imbalanced situation where we can’t do good science. I hope that we’re going to see an increased value placed on the other parts of the science, which isn’t to say that we’re done building models. I’m sure there’s more research to be done there, but we can’t make meaningful progress in model building if we can’t do meaningful testing of the models, and we can’t do meaningful testing of the models if it’s not valued.”

Thanks for reading,

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI Channel.

Thanks for reading,

Khari Johnson

Senior AI Staff Writer

Credit: Source link