Text2SQL

Written by: Dr. István Szakadát

Machine search in text corpora

When CD-ROMs appeared in the early nineties, they were advertised as being able to store the text of as many books on one disc as would fit on a traditional bookshelf in paper format. Moreover, with the help of computers, searching through them was very fast, and this could be even more contrasted with human searching abilities. "Let's find how many times dogs are mentioned in all of Shakespeare's works!" This task seemed an impossible challenge for a human, while the computer answered in an instant and even showed the results immediately. The incredible performance of machines became even more evident when the web appeared, and the digital re-accumulation of human knowledge began. We started building a new world of human knowledge in which anyone could access anything, anytime, from anywhere (with some exaggeration). The quantity of documents accessible through the web increased at an astounding rate. In 2010, Eric Schmidt, the then-leader of Google, stated that:

"Nowadays, we produce as much information every two days as we did from the beginning of time until 2003."

The digital universe continuously expanded, and access to the content of the gigantic text corpus was made possible by the emerging machine services, the search engines. In 2000, Tim Berners-Lee, the father of the World-Wide-Web project, stated that the first decade of the web had fulfilled its mission. We store the entire knowledge of humanity in a single interconnected system, and search engines can read the accumulated vast amount of text lightning fast, effectively helping people find the documents they are looking for. At this point, Berners-Lee announced the web program for the coming decades, setting new goals for the next developments.

"The first decade of the web was about teaching machines to read; now it's about teaching them to understand the texts."

Why was it necessary to set new goals? What was the reason for this? Although the impressive performance of the machines was remarkable, it was also clear that computer search had its limitations. When we tasked the computer with searching for the word 'gyula', it immediately returned the results (increasingly more over time), but it couldn't handle sentences like:

"Gyula Gyula was the gyula in the city named Gyula for five years."

The machine found the search term 'gyula' four times within the sentence, but it couldn't deal with the distinction that a human would immediately recognize – that the found term appears in four different senses in the quoted sentence: 'gyula' could be a surname, a first name, a title, and a city name (and we could find or create even more interpretative/usages). At a certain level of language, syntactically, the search engine is efficient, but at the next – semantic – level, it is not. The machine – at this point, at this time – does not know that 'gyula' can mean several things within a sentence. The above example illustrates what Tim Berners-Lee might have meant when he said, "machines don't understand what they read." 

The promise of resolving this issue lay in the development of semantic search techniques. This is why the new web program was named the 'Semantic Search Initiative.' However, this program did not achieve the rapid and spectacular successes seen in the previous decade of the web project.

There was, of course, another problem with search engines. When a person formulated a search query, the machines returned a list of results consisting of documents containing the search term. Over time, the length of these lists grew exponentially, becoming a permanent trend. Soon, search results lists containing billions of items appeared. But what was a person supposed to do with such quantities? In a certain sense, we returned to where we started: the task of searching fell back on humans. They had to manually explore the documents in the results list, clicking one after another to see in what context the search term appeared in the selected text. We asked for little but received too much. We could say that the machine had become overly verbose. What we needed was for the search to be more focused. 

Reliability, Verbosity, Relevance Handling

There was, of course, a response from search engine developers to this verbosity. Whether through self-awareness or experience, we could assume that users would only look at the suggestions at the top of the list if it was too long. In this case, the most crucial service aspect became the quality of the search result list's relevance ranking (what and why the algorithm places at the top of the list). There were many attempts at this in the early days. Then Google appeared and quickly dominated the search engine market with its service. This was because Google's search engine had new and much better relevance handling capabilities compared to its competitors (AltaVista, Lycos, HotBot, and others). 

However, the success of Google's search engine was not due to semantic competence or any other linguistic ability but rather to its novelty in how it could gather and algorithmically process evaluative information expressed by people scattered across web documents (value expressions). We can now say that by collecting these data traces and creating some relevance indicators from them, Google did nothing more than exploit one of the peculiar manifestations of collective human intelligence (and although it is not often talked about, it is true that the Google search engine hides perhaps the most significant voluntary and unconscious crowdsourcing project). Of course, this does not diminish the merits of Google's engineers, as building the technical infrastructure and developing the algorithm required a lot of engineering knowledge.

Another type of criticism could also be levelled against search engines. If we consider this human-machine relationship a communication act where the human asks and the machine answers, and we know that the search engine's response is to return a results list, it can be stated that the question and the answer differ in genre and linguistic quality. The person enters a search term, thus asking a question (an interrogative sentence), and the machine returns a list of titles (shorter or longer series of sentences) in response, with the "task assignment" that the person should follow the offered links and read the accessible texts one after another. This relationship significantly differs from the basic form of human communication, where we communicate with each other in sentences, ask questions (sentences), and expect answers (sentences) from the other person. 

This difference provided the basis for developing new types of search systems. It was logical that if we expect "one-sentence" answers to "one-sentence" questions, the new functionality system should be named 'Q&A' (question and answer) or 'question-answer system.' 

Question-Answer Systems

Attempts to develop question-answer systems have been ongoing for a long time, but no one has achieved a breakthrough success. The obvious reason for this is that successful operation requires modelling human semantic capabilities and implementing these models on computers.

This task is far from as easily achievable as executing the operation of syntactic-based (character-based similarity) search. It requires uncovering the internal structure of sentences, the relationships between words and phrases, their meanings, and their connections to language use contexts—in other words, the entire rule system of natural language. So far, this has not been accomplished. Many believe it may not even be possible, and some think it is a completely pointless endeavour. However, many have tried and continue to seek solutions.

Semantic Search

Google began developing the semantic capabilities of its search engine quite early (in the early 2000s). It has integrated many smaller and larger significance, but always focused on specific areas of knowledge or particular data types, semantic modules into its search engine. However, the major breakthrough is still awaited. The major breakthrough would be if Google suddenly became a semantic search engine, but the ongoing quantitative changes have visibly not yet reached the level necessary for a qualitative change.

Compared to the early days, for quite some time now, Google’s search results page has been providing different services for different questions, due to the continuous integration of new functionalities that rely on semantic capabilities. From the beginning, the search engine has functioned as a calculator, a currency converter, a postal code search tool, and over time, it has handled an increasing number of knowledge areas a little differently from others in general. 

More and more frequently, we receive search results pages where, instead of a list of related web pages (i.e., a set of document titles) at the top of the page, we see relevant – unique – information related to the question asked (setting aside page transformations applied for advertising and relevance management reasons, as there have been many changes due to those as well).

As a convincing example, we can refer to the solution where, when we enter the query 'Author of The Man with the Golden Touch' as a search term, we get the specific answer to the specific question at the top of the page (updated to the present, of course). This is followed by a lot of additional information, and only then comes the long list of pages that contain the searched expression.

Despite spectacular partial successes, recognizing and understanding the internal structure of sentences, the meanings of words and phrases, that is, developing the semantic capabilities of machines, remains a persistent challenge. Meanwhile, other technologies designed for different purposes have emerged and matured, which—partially or entirely—have transformed and continue to transform our views on the linguistic capabilities of machines.

Machine Conversation Based on Text Corpora

Developments in artificial intelligence (AI) based on deep learning algorithms have followed a very different logic compared to search engines, and these developments achieved incredible and astonishingly rapid successes by the 2020s. The new technologies based on LLM (Large Language Model) - from a few but very important aspects - elevated the speech capabilities of machines to the quality of human discourse. With ChatGPT, one can communicate just as people do with each other. You can engage in a discourse with the machine exactly in the same way – and in many aspects with the same quality – as we converse with each other. We ask questions and receive well-formed sentences in response, both syntactically and semantically. Such a breakthrough and explosive change have rarely been experienced since the beginning of the digital world.

What can an AI-based speaking machine do? Speak – just like we humans have been able to. What underlies this technology? A multitude of algorithms, numerous calculations, and a vast amount of text. We haven't delved into the algorithms, the computational capacity requirements, or the technical infrastructure issues deep within the service so far, and we can disregard them now as well. These are obviously extremely important for ensuring systematic operation, but they are not really necessary for comparing the processes analyzed here, establishing the interpretive framework, and making final evaluations. 

To simplify matters, let's say that this speaking machine also uses the same vast text corpus as the search engine. This "shared" text corpus is nothing but a massive collection of previously scattered, human-created texts that have been gathered and digitally processed. 

This corpus enables the search engine to find relevant text locations within the entire set when given a query and return them as results, allowing humans to see the context in which the search term appeared in a previously recorded document. If the search engine works well – in a technical, syntactic sense – the semantic correctness of the answers provided by the machine cannot be questioned. If we jump to a page offered by the search engine and find the information there unsuitable or insufficient for us, the machine cannot be held responsible because it did not produce it. Of course, one of the important services of the search engine is the relevance criteria by which it ranks the result documents, as this significantly influences what we read (and what we don't). However, even in this case, we cannot hold the machine accountable for the quality of the readable answers. The search engine can always defend itself in case of a "bad answer" by pointing to some part of the text corpus, saying that another person said (wrote) this, so they are responsible for the content of the answer.

This corpus enables the search engine to find relevant text locations within the entire set when given a query and return them as results, allowing humans to see the context in which the search term appeared in a previously recorded document. If the search engine works well – in a technical, syntactic sense – the semantic correctness of the answers provided by the machine cannot be questioned. If we jump to a page offered by the search engine and find the information there unsuitable or insufficient for us, the machine cannot be held responsible because it did not produce it. Of course, one of the important services of the search engine is the relevance criteria by which it ranks the result documents, as this significantly influences what we read (and what we don't). However, even in this case, we cannot hold the machine accountable for the quality of the readable answers. The search engine can always defend itself in case of a "bad answer" by pointing to some part of the text corpus, saying that another person said (wrote) this, so they are responsible for the content of the answer. 

The aspect of context management is important here. Human linguistic communication is extremely context-sensitive, meaning that our linguistic utterances can only be truly understood in a given context, and our similar or very similar utterances (words, expressions, sentences) can carry different meanings from context to context. This is what we mean by saying our language use (our language) is extremely flexible. This was one of the biggest obstacles to developing effective machine language capabilities for a long time. The speaking machine solved this problem by being able to identify and learn from the vast number of language use contexts found in its text corpus, based on the words, expressions, sentences, text environments, and statistical patterns among them. Since its corpus consists of texts previously recorded by humans, we can say that the machine learns in which contexts people use what words, what expressions, and what answers to questions with what probability. With sufficient data, computational capacity, money, and of course, sufficient human intelligence, this combination can eventually become operational.

Hallucination

Although the conversational machine can engage in dialogue similarly to how humans converse with each other, we cannot say that it has reached the level of intelligence. But why not? The conversational machine does not use its corpus as a reference but to participate in conceptually well-formed discourse. It knows that a dog barks, a cat has kittens, not chicks, an airplane flies but is not a bird, a penguin cannot fly yet is a bird, etc. Its sentences are almost always well-formed, and its knowledge about the world is convincing. It often performs well on the particular, individual, factual level of human knowledge, possessing an extensive factual knowledge base. However, on this level, it can easily make mistakes. It quickly became evident that the conversational machine often hallucinates, producing factually unfounded, erroneous responses. We will not delve into the reasons for this here, even though perhaps the most serious criticism against the conversational machine is directed at this weakness. From the perspective of our train of thought, however, this shortcoming is not particularly significant.

Translation

Although the phenomenon of hallucination may raise doubts, we can also debate whether substantial improvement can be hoped for in this area, and if so, how and to what extent. However, it is hardly debatable that the conversational machine's speech capabilities are at a very high level. Consequently, this implies that based on this ability, the system has excellent translation capabilities between different languages. It is not only adept at translating between two natural languages but is also capable of translating between natural and formal languages. Exploiting this latter capability holds enormous potential, particularly when used to make the conversational machine function as a translator between natural languages and a specific formal language, SQL.

SQL (Structured Query Language) is the query language for databases. Early in the development of computer science, this formal language was established to standardize the extraction of information stored in databases. With some simplification, we can say that all knowledge built into databases by humans over the past decades (nearly fifty years) is accessible through SQL commands. While this language is not complex, laypersons cannot use it, necessitating the help of specialists to access the information stored in databases. This has imposed limitations on those for whom the continuous access and everyday use of this wealth of knowledge would otherwise be important. 

At this point, the conversational machine offers new opportunities. Before we delve into how and why this is the case, we need to understand what the concept of a database means and why it is so important from the perspective of knowledge representation and knowledge management.

Machine Search in Databases

At the dawn of computer science, experts began building databases, and this fact alone is remarkable. However, the importance of databases is further highlighted by the immense financial value they represent within the economy, as well as the vast amount of information stored in this form. The quantitative dominance of databases can be illustrated using a conceptual pair introduced for seemingly different purposes. In the early 2000s, the concepts of the surface web and deep web emerged, which have since been used in various senses and for various purposes. The surface web is defined as the collection of information freely accessible through the web. Here, free access means that both people and machines can reach and read the information stored on the given page. This is contrasted with the notion of the deep web, which refers to those sites that are technically freely accessible via the network but are practically restricted in some way. There are multiple answers to why the deep web pages cannot be used as freely and unrestrictedly as those on the surface web. On one hand, there are sites (quite a few) that place technical barriers to entry (password- protecting the given area). In these cases, technical and legal barriers prevent free use. But there is another access barrier, not explained by technical and legal obstacles, but by the fact that to use the data stored in freely accessible web databases, one needs to know the internal structure of the database and have the ability to use the SQL language. Neither machines nor humans can overcome this obstacle. When search engines reach such sites and want to harvest the content found there to incorporate it into their search services, they cannot query the database content because they do not know the schema information. If they did, they could harvest data just like they do with content found on the surface web. Humans are even more "helpless" since, even with schema information, they could not use the query language to extract data from the databases because they do not know how to use the SQL query language. Yet, the stakes are high at this point. Expert estimates suggest that overall, magnitudes more information is stored on these deep web pages compared to the entire expanse of the surface web.

Databases are not only important because they provide access to an incredible amount of information, but also because their quality is somehow better, more accurate, and clearer than that of simple text documents. To understand why this is, we need to know the essence, quality of a database, and the difference between "simple" text and a database.

Database

When shopkeepers thousands of years ago started keeping records on paper (or clay tablets) to track how much of each product they sold daily, they were essentially entering words, phrases, and numbers into a table. The tabular arrangement of linguistic information back then was a form of knowledge representation similar to today's databases. In other words, we have long known that writing text in tables can have advantages over simply stringing sentences together. Information stored in a table can be read linearly, just like written text, but it doesn't require some of the grammatical rules necessary for well-formed sentences. However, a table is not only arranged in one direction (like text) but also organizes its elements into columns. This means the table can be read and evaluated both horizontally and vertically (as the shopkeeper does when they enter the number and price of products sold each day and then sums these values at the end of the day to find out the daily turnover or revenue). The essence of the tabular format is its arrangement in two dimensions. We can say that a table is structured text.

Short Linguistic Theory Digression

At this point, a more thorough discussion would necessarily address a type of human linguistic capability not yet covered, for which Ferdinand de Saussure's theory could provide insight. According to his theory, we can explain why and how we can create structured text. Here, it is sufficient to recall from Saussure's theory that since language use requires two types of abilities from humans, our communicative activity must be understood and interpreted in two dimensions. 

The first linguistic ability is when, during our utterances, we string our words together linearly to form meaningful sentences. This can be understood in the morphosyntactic dimension, and the question here is how we can form well-constructed sentences. This is the primary level of our linguistic ability, where immediate visible/audible results are produced. At this level (with this ability), our sentences are formed, and from these sentences, our audible speeches or readable written texts and documents are created.

However, Saussure recognized that we have another linguistic ability: when we compose our sentences, we also pay attention, in another dimension, to the rules by which we can insert current words into morphosyntactic forms, patterns (sentence schemas). This ability can be described as a kind of classificatory (semantic, ontological) knowledge that operates based on our knowledge of the world. When we form sentences, certain words' usage is permitted or forbidden in a specific position within a given sentence type (sentence schema). This knowledge – under normal circumstances – does not appear in either the acoustic or visual space, but we still use it. Saussure called this the associative dimension; his followers today often use the term paradigmatic dimension for the same concept. When we write something in a table, during the process of organizing into columns (the vertical dimension), we utilize this ability.

Using a database (table) means that we record (represent) our statements (knowledge) about the world in such a way that within the statements, we separate the meaningful components (words, phrases). By doing so, we are able to handle the components of our statements separately (reference, search, calculate, etc.). This way of shaping linguistic messages makes the content of our statements much clearer and more precise compared to statements expressed in free text. Because of the clarity resulting from the structure, we can do much more with the sentences expressed this way, perform various operations, and extract more from the same set of sentences. This qualitative advantage gives databases their strength and benefits.

Therefore, databases are richer in data compared to text, providing us with more, but at a cost. This added value arises because building databases requires a lot of work – consisting of organizational operations – and this initial investment is often quite high. Fortunately, many have taken on this work, resulting in many databases being built in the past and continuing to be built in the present.

And at this point, we can return to our original line of thought.

Text2SQL

We left off at the point where:

  • On one hand, we have a conversational machine (chatbot) that can communicate very well in natural language, translates fairly well between natural languages, as well as between natural and certain formal languages, but is not factually reliable and often hallucinates.
  • On the other hand, we have databases that organize knowledge in a structured way, ensuring that the accessible knowledge is reliable, accurate, and can be searched, reorganized, and calculated in many ways.

As previously indicated, these two capabilities can be connected by using the conversational machine solely to translate natural language questions posed by SQL novices into SQL commands, which can then be sent to the database. The tabular answers received from the database can be translated back into free text. In this case, there is no risk of hallucination because the answers are not expected from the conversational machine but from the databases, with the conversational machine only serving as an interpreter. Of course, this requires metadata describing the structure of the databases, schema information, and the conversational machine must be tuned for this specific translation task. This, however, seems feasible.

For the database maintainer, an important consideration might be that the schema information does not need to be publicly released; it is sufficient to show or teach it to the chatbot. If the chatbot knows the structure of the database (tables, fields of the tables, types, relationships between tables, etc.), it will know how to formulate queries to extract the desired data from the database. Naturally, it needs access to do this, but that can be arranged. It is possible that the chatbot cannot yet formulate every query as a human expert would, but it can already generate simpler commands accurately and will surely be able to improve this capability significantly in the near future.