piratesbountymegaways| Microsoft, Google and Meta are betting on synthetic data to build AI models

Date: 4个月前 (05-10)View: 56Comments: 0

There is a huge amount of data behind every ingenious response of chatbots-in some cases, trillions of words need to be extracted from articles, books and online reviews to teach artificial intelligence systems to understand users' queries. The traditional view in the industry is that more and more information will be needed to create the next generation of artificial intelligence products.

However, there is a big problem with the plan: the high-quality data available on the Internet is limited. To obtain the data, artificial intelligence companies typically either pay publishers millions of dollars to license content or download data from websites, putting themselves at risk of copyright disputes. More and more top artificial intelligence companies are exploring another way to divide the industry: the use of synthetic data is essentially fake data.

Here's how it works: technology companies can use their own artificial intelligence systems to generate text and other media. These manual data can then be used to train future versions of the same system, which Anthropic CEO Dario Amodei (Dario Amodei) calls a potential "unlimited data generation engine." In this way, artificial intelligence companies can avoid causing many legal, moral and privacy issues.

The idea of synthesizing data in computing is not new-the technology has been used for decades, ranging from the anonymization of personal information to road simulation of self-driving technology. However, the rise of generative artificial intelligence makes it easier for people to create higher-quality synthetic data on a large scale, and it also makes this practice a new urgency.

At Microsoft, the generative artificial intelligence research team used synthetic data in a recent project. They hope to build a smaller, less resource-intensive artificial intelligence model, but still have effective language and reasoning capabilities. To do this, they try to imitate the way children learn language by reading stories.

Instead of providing a large number of children's books to the artificial intelligence model, the team listed 3000 words that four-year-olds could understand. They then asked the artificial intelligence model to use a noun, a verb and an adjective in the vocabulary to create a children's story. The researchers repeated this tip millions of times in a few days, generating millions of short stories and eventually helping to develop another more powerful language model. Microsoft has made this new "small" language model series Phi-3 open source and open to the public.

"all of a sudden, you have a lot more control than you used to," said S é bastien Bubeck, vice president of generative artificial intelligence at Microsoft. You can decide on a more subtle level what you want your model to learn. "

With synthetic data, you can also better guide artificial intelligence systems through the learning process by adding more explanations to the data, otherwise the machine may be confused in the process, Bubeck said.

However, some artificial intelligence experts are concerned about the risks of this technology. A group of researchers from Oxford, Cambridge and several other leading universities published a paper last year explaining why using synthetic data generated by ChatGPT to build new artificial intelligence models can lead to what they call a "model crash".

In their experiments, artificial intelligence models based on ChatGPT output began to show "irreversible defects" and seemed to lose memory of the original training content. For example, researchers use text about British historic buildings to suggest a large language artificial intelligence model. When they used synthetic data to retrain the model many times, the model began to generate meaningless gibberish about long-eared rabbits.

piratesbountymegaways| Microsoft, Google and Meta are betting on synthetic data to build AI models

The researchers are also concerned that synthetic data may amplify biases and toxicity in the data set. Some proponents of synthetic data say that by taking appropriate measures, models developed in this way can be as accurate or even better as those based on real data.

Dr Zahar Shumelov (Zakhar Shumaylov) of the University of Cambridge (University of Cambridge) said in an email: "if handled properly, synthetic data can be very useful. However, there is no clear answer as to how to handle it properly.PiratesbountymegawaysSome prejudices may be difficult for human beings to detect. " Schumelov is one of the co-authors of the above-mentioned paper on model collapse.

There is also a more philosophical debate: if large language models are trapped in an endless cycle of training based on their own content, will artificial intelligence eventually become less a machine that imitates human intelligence? and more machines that mimic other machine languages?

Percy Liang, a professor of computer science at Stanford University, says companies still need real human intelligence, such as books, articles and code, in order to produce useful synthetic data. "the synthetic data are not real data, just like you didn't really climb Mount Everest in a dream," Liang said in an email. "

Pioneers in the fields of synthetic data and artificial intelligence agree that you cannot exclude humans from this process.PiratesbountymegawaysWe still need real people to create and refine artificial datasets.

"compositing data is not just about pressing a button and saying, 'Hey, help me generate some data,'" Bubeck said. This is a very complicated process. It takes a lot of manpower to create synthetic data on a large scale. "

Tags:

Prev: freeonlinescratchcardswinrealmoneynodepositusa| Major European stock indexes collectively closed higher, and multiple indices continued to hit record highs
Next: blockchaingamesonsteam| Xingzheng International (06058) subsidiaries subscribe for US$300 million in notes: total amount reaches US$6 million