09/09/2020 | News release | Distributed by Public on 09/09/2020 02:38
José Manuel Gómez Pérez, Ronald Denaux and Andrés García-Silva, all from the Expert System Artificial Intelligence (AI) Research Lab in Madrid, have just published 'A Practical Guide to Hybrid Natural Language Processing.' The book is designed to be a reference on the possibilities offered by the intelligence use of current methods and tools in Natural Language Processing and Natural Language Understanding and on the short- and long-term challenges that users face as these technologies continue to evolve.
This is part 2 of our interview with one of the authors, José Manuel Gómez Pérez,Expert System R&D & International Projects Director.
What are some of the problems in detecting disinformation in text?
One challenge is the use of ordinary language or terminology as code, which enables members of a group to communicate on platforms like Twitter. Phrases like 'snowflake' or 'Karen' have political connotations, while '420' is an insider term for cannabis culture.
Such neologisms are not defined in any dictionary nor do they appear labeled in a dataset that we can use directly to train a model that allows us to identify them. In fact, there are practically no public datasets large enough and with the necessary quality in this area. Therefore, we would need to train the system using related tasks for which there is more labeled data available, such as false reviews about hotels, restaurants or celebrities, and then transfer the trained model to our domain. The combination of the representations learned in this way with neural representations of words and concepts extracted from a knowledge graph using the techniques that we describe in the book can significantly improve the results compared to approaches that do not apply this type of data augmentation.
What about fake news?
Fake news and disinformation pose their own challenges. One aspect of this is identifying whether a post on social media, for example, was posted by a human or a bot (since bots are more likely to be used to spread fake news) based solely on the content of the post. Another is verifying the information that appears in the post itself, which is the goal of the thousands of fact checkers around the world who are engaged in such activities today.
In both cases, the shortage of sufficiently large, labeled datasets in various languages and themes (currently, most deal with political issues and are in English) is a problem. In addition, in order to understand how disinformation spreads and explain what it consists of and why our system reaches a certain conclusion, we need to combine neural models of language processing, which allow us to semantically compare a post with others that are already verified, with explicit representations of the sources of this information and the indicators that have led us to classify it as credible or not.
In the wake of COVID-19,being able to access information in scientific publications is especially valuable. What are the unique challenges here?
In this area, a classic problem in Artificial Intelligence is the development of systems capable of reading a scientific article or a chapter of a book in the same way a person would, assimilating the content that appears in the text, but also in the images, charts and diagrams it contains, so that the system is able to successfully answer questions asked about its content.
This presents several important challenges. First, scientific terminology is very broad, heterogeneous, and complex, with many terms that consist of multiple words. Furthermore, although the volume of scientific publications is virtually infinite, annotated datasets for tasks such as question answering related to the understanding of scientific texts are scarce.
We also need to be able to represent the information that appears in different data modalities to be able to reason with it in a homogeneous way. For example, we would need to be able to correlate a diagram that represents the process of photosynthesis, the text that describes it and its representation as a concept in a structured knowledge graph. It is necessary to look for alternative sources of supervision, such as the correspondence between a figure and the text contained in its footer.
This type of approach allows our models to learn characteristics associated with text and images that we can also enrich with representations of concepts extracted from scientific knowledge graphs. In the book, we demonstrate how to do this successfully in a way that improves the results compared to more conventional approaches.
What are the next challenges related to the interpretation of natural language?
In the last two or three years, the emergence of neural language models and, above all, those based on the Transformer architecture, such as BERT, GPT or RoBERTa, have revolutionized the NLP / NLU panorama, not only in the world of academia and research, but also in business. The way of solving problems related to text comprehension has changed radically thanks to the ability of these models and the cost savings that come from training NLP / NLU tasks in a specific domain without having to train from scratch, but by e fine-tuning the model on a much smaller subset of domain data than would previously have been necessary. In a very short time, most NLP / NLU systems will work on pre-trained Transformer-based language models.
However, we still have a lot to learn in order to understand exactly how these models represent the information contained in the language at a lexical, syntactic and semantic level. Without this knowledge, it is not possible to offer capabilities related to the explanation, justification or interpretation of the model's results, which are critical in domains such as medicine, banking or insurance. In fact, it is extremely difficult for these models to interpret this knowledge logically, even more so to label it based on specific concepts or relationships, which on the other hand, turn out to be the main entities represented in a knowledge graph.
So, where should the research go?
In order to obtain significant advances in language comprehension by machines, it is necessary to enhance the incipient capacities of language models to reason about the text. For example, GPT-2 (now GPT-3 has just been released) is capable of inferring that a person's native language from Italy is probably Italian or that a person from Madrid is Spanish. However, the model cannot calculate the square root of 2, nor does it know about the lethal effect of hydrochloric acid intake. Limitations like these prompt us to investigate how to merge the knowledge captured in language models with the symbolic knowledge contained in structured graphs.
In this sense, knowledge graphs such as ConceptNet or Atomic that focus on the representation of something as human as common sense, are of great interest. The progress in the development of complex reasoners based on the composition of different simpler models and the gradual creation of benchmarks oriented to their evaluation, such as DROP, are also promising.
In terms of R&D, what are the next challenges that your team will focus on?
Like I mentioned earlier, one of the main research questions that we ask ourselves is: Given a language model and a knowledge graph, is it possible to identify a correspondence between the representations of one and the other? Our results indicate that it is, and we address this in the book. The next step will be to create those mappings so that they can be exploited systematically and generate truly intelligent hybrid models that, paraphrasing Hiroaki Kitano, creator of the Robocup, may at some point make on their own Nobel Prize-worthy findings.