Have you ever thought that your research paper may have gripped theTraining AIThat's right. That's right, many academic publishers are "packaging and selling" their results to tech companies that develop AI models, a move that has no doubt sent ripples through the scientific community, especially when authors know nothing about it. Experts say that if your have yet to be published by a certainLarge Language Models(LLM) use, then it will likely be "utilized" in the near future as well.
Recently, UK-based academic publisher Taylor & Francis struck a $10 million deal with Microsoft, allowing the tech giant to use their research data to boost the capabilities of AI systems. And back in June, U.S. publisher Wiley also struck a deal with a company to receive $23 million in return for their content being used to train generative AI models.
If a paper is available online, whether it's open access or behind a paywall, it's likely to have been fed into some large language model. Lucy Lu Wang, an AI researcher at the University of Washington, said, "Once a paper has been used to train a model, there's no way to remove it after the model has been trained."
Large language models require large amounts of data to train, often crawled from the Internet. By analyzing hundreds of millions of language fragments, these models are able to learn and generate fluent text. Academic papers have become a very valuable "treasure" for LLM developers due to their high information density and length. Such data helps AI to make better reasoning in science.
The trend of purchasing high-quality datasets has been on the rise lately, and many well-known media outlets and platforms have begun to partner with AI developers to sell their content. Considering that many of these works could be captured without a word if an agreement is not reached, such collaborations will only become more common in the future.
However, while some AI developers, such as the Large-scale Artificial Intelligence Network (LSAIN), choose to keep their datasets open, many companies developing generative AI keep their training data under wraps, "We don't know anything about their training data. " Open source platforms like arXiv and databases such as PubMed are certainly popular targets for AI companies to grab, according to experts.
Trying to prove whether a paper appears in the training set of a particular LLM is not simple. A researcher can use unusual sentences in a paper to test whether the model output matches the original text, but this doesn't completely prove that the paper wasn't used, as the developer can tweak the model to avoid outputting the training data directly.
Even if it is proven that an LLM uses a particular text, what happens next? Publishers claim that unauthorized use of copyrighted text constitutes copyright infringement, but there are also objections that LLMs are not copying the text, but rather analyzing the content of the information to generate new text.
A lawsuit over copyright is currently underway in the US in what could become a landmark case. The New York Times is suing Microsoft and OpenAI, the developer of ChatGPT, alleging that they used its news content to train models without a license.
Many academics have welcomed the inclusion of their work in the LLM's training data, especially if these models enhance the accuracy of their research. However, this is not something that every researcher in every profession is comfortable with, and many feel that their work is under threat.
At this stage individual research authors have little say in the publisher's decision to sell, and there is a lack of a clear mechanism for how credit is assigned to articles that have been made public and whether they are used. Some researchers have expressed frustration with this : "We would like to have the help of AI models, but we would also like to have a fair mechanism, and right now we haven't found such a solution."