Oracles have been a mainstay of blockchain from its roots, and have continued to evolve as what started as simple crypto needs became the complex needs of a flourishing Web3 ecosystem composed of chains, dApps, and platforms. This evolution hasn’t been without its challenges, with themes of centralization, latency, and data retrieval all creating difficulties as innovators scramble to find solutions.
However, a new threat to the way oracles operate has made its way into the mainstream, inadvertently creating major problems that oracles will need to resolve quickly. The rise of artificial intelligence, and Large Language Models (LLMs) in particular, have caused many sources of information that oracles rely upon to dry up.
Let’s dig into why this is happening, what oracles must do to survive, and how collective groups have already begun developing solutions such as the Pyth Network to sidestep the need for this shrinking source of data.
Large Language Models and Copyright Laundering
Most people have by now heard of ChatGPT in all its wonder, able to write human-like papers, articles, and even detailed answers to the most bizarre questions. ChatGPT is a Large Language Model (LLM), but it is not the only one in use. Others include GPT-4, LLaMA 2, Mistral 7B, and more. Each model is set up slightly differently but follows the same overall process.
Using a machine learning architecture called a transformer-based neural network, the model consumes vast amounts of data in the form of books, articles, documents, papers, emails, texts, and essentially anything it can find and consume. While both quality and quantity matter, when a model consumes 100’s of GB of plain text from thousands of sources, it is able to recognize speech patterns, and thought processes, and after fine tuning can begin to understand questions and also provide in-depth answers.
These types of models aren’t truly intelligent as we think of the human mind; they can’t actually create original thoughts. However, they can regurgitate information in a human-like way, and given
enough material to work with, they can do well enough that originality doesn’t seem to matter.
Where this has begun to create problems for oracles is how these models extract their data. Depending on the model, an LLM is “fed” by scraping vast amounts of information from the sources mentioned above.
While public use information might be okay to consume, these data sets also include news sites, academic repositories, and even personal data from emails, documents, and anywhere else that these teams can find data to add to the learning pile. In a way, the data collection process for an LLM is much like aggressive commercial fishing: nets collect everything in an area, which includes what is allowed but also great amounts of illegal bycatch.
As the information is processed and fed into the LLM, the protected and private data is essentially blended together with everything else, obscuring its copyright protection and making it difficult to track down the output of the model.
However, even if all the data taken was free game, the LLMs themselves are causing the creators of information to lose money. LLMs like ChatGPT receive requests from users, and instead of linking back to sources or sending the user to those sources with the original information, it will answer the question or write an extended response using the information it has already collected.
In this model, users never visit the sites creating the knowledge, and these sites as a result don’t get the visits, ad revenue, or any of the other revenue/recognition they deserve. This takes away the incentive to publish freely available information, whether the author is looking for recognition, an audience, or traffic-based revenue.
Oracles Need Open Access Information
So why are LLMs such a big problem for oracles? Simply put, many websites are now moving behind paywalls to protect their content from LLM models using their original content to train their models. Because regulations have a long way before they catch up to current events, this is one of the few options.
Oracles, however, also rely on freely available information across the web in order to collect the information they need to put on-chain. Pricing information, account status information, and current events whose outcomes determine smart contract outcomes; all of these elements rely on freely available information, but this pool of data is quickly shrinking.
Sadly, this is the latest in a list of key issues oracles have struggled with, and paywalls will exacerbate these issues. Some oracles, particularly push oracles, suffer from major delays in latency, from minutes to an hour or more. When apps are working to trade volatile moving token prices, this amount of latency is a disaster waiting to happen for the end user.
With fewer sources of information available, the time it takes to find the information and bring it on-chain via oracles only expands. Centralization of the oracles could potentially bring down latency, but this creates serious single-point-of-failure concerns.
Where Do We Go From Here?
There have been several solutions in the works that address the concern of rising paywalls, less information, and crippling latency. In terms of the latency itself, a number of platforms have begun launching pull-based oracles, where dApps can get updated information on demand, greatly reducing latency.
However, a pull-based oracle does nothing to solve the shrinking data problem. The best hope for now may reside in a project started by a group of trading firms and exchanges that saw the dangers facing Oracle data collection (both the amount of data and the latency), and agreed to contribute their proprietary data to the network so that all members could benefit significantly from the sum total of data for use.
This project became Pyth Network, which is incidentally also a pull-based, on-demand model. The combination of these elements brings latency down to 400ms, which is fast enough for the most volatile of price changes.
Also Read: Chainlink: Concept to Reality – Comparing LINK and Bitcoin
Closing Thoughts:
We can hope to see more applications, platforms, and chains look to these higher-performance oracles that have a firm roadmap for data collection and are using an on-demand model for minimizing latency. Without these improvements, the industry may start to see some current use cases no longer viable due to a lack of insight or delays that take away potential value.
Regardless, oracle providers should be working on the core issues that technologies like LLMs have created, and be on the lookout for new complications so that oracles can evolve one step ahead for the benefit of all Web3.