Understanding The Data Flood: Why Ontologies Are Critical
Enabling humans and AI to quickly derive truly useful meaning from data requires us to inject context and relational information by linking the data to knowledge models
“In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients.” –Herbert Simon
Amazing advances in technology and the move toward open standards have enabled us to make great progress in helping government agencies achieve their data integration goals. While data wrangling and cleaning are still necessary steps before integration and analysis, automation is aiding these processes. We can now bring in all sorts of data from many types of sensors, transported via diverse protocols and conforming with many data standards, to a single and integrated system for storage, analysis, and display.
However, this integration of data into a centralized location does not necessarily result in better understanding of it. Data dominance does not necessarily equal decision superiority. In fact, there is now a greater risk of encountering the situation that American economist and scientist Herbert Simon described in his insightful quote several decades ago.
We now have a simple term for this situation: “drowning in data” or even “data paralysis.” This can be a real and dangerous thing. But it is not an inevitable destination if we follow a smart path to data integration.
The limitations of traditional data modeling
Once we start to drown in our data, the first response is figuring out better visualization of it all. "Give me dashboards so I can visualize my data," our customers may say.
Often, dashboard creation requires additional data wrangling and reformatting, which can be done fairly quickly. But even with a collection of dashboards, you may just be presented with a bunch of numbers transformed into lines, colors, and polygons. You can certainly derive knowledge from visual representations in some cases, but you are confronted with more data in more formats in all cases.
Once we determine that more graphs, bars, and colors are helpful in some cases but not sufficient to fully understand what we really need to know for decision-making, the next step is calling in a data scientist to wrangle with the data some more, do some analysis, and provide answers to specific questions that simple visualization did not deliver. But this requires you to call the data scientist over and over again whenever you have another question.
What often slows down the data scientist’s responsiveness is the need to relate the data to real-world concepts. Data scientists oftentimes are not domain experts, so they must work with subject matter experts to understand each data element and how the elements relate to each other before they can develop the right set of analytics. With traditional modeling techniques, this level of understanding cannot be obtained by simply looking at the model.
The next attempt to achieve more rapid sense-making of data is using artificial intelligence (AI) and machine learning (ML) tools. The hope is that the computer can help us make sense of everything and do it at the “speed of decision-making.” But in taking this step toward AI/ML sense-making and decision support, it is important to give the machine clues regarding the true meaning of all the data. From a data model, the machine will understand the meaning of the data no better than the data scientist. Most important, meaning from data should be derived from human understanding or from our shared mental world, with that knowledge passed on to the machine.
Giving meaning to incoming data
The best way to do this is through the deployment of ontologies, which are developed to represent the shared understanding of the domain of interest in a logical way. This formalized, machine-readable ontology is what we call a knowledge model. AI can be employed to try and make sense of data, but without being calibrated with this knowledge model, it will then be learning on its own and just memorizing what you tell it to memorize.
For example, you can train a neural network-based algorithm to detect representations of cats in images. However, all you are really doing is telling the computer to find a similar pattern of pixels in an image, look for images in which that pattern appears, and then label them with a random string of three letters starting with “c.”
The computer does not understand what the three-letter string really signifies. So, it can infer nothing whenever it finds a cat in a picture. But introducing the label “cat” and attaching an ontology to it (e.g., it is an animal, it has fur, it is feline, etc.), now you can ask the machine questions, such as, “Did you find any pictures of animals?” and it can show you all the cat pictures.
More important, you can program the machine with questions like, “Alert me when you find any threats in the pictures,” and it can tell you each time it finds a tiger in an image. This is because it will know what a tiger is, based on the ontology, which provides sufficient information on the dangerous features of tigers (e.g., it has large, sharp claws and teeth and is much more powerful and potentially harmful than a cat, etc.) and equate them to being potential threats. Now we are using the computer for decision support.
At this point, we can keep the wealth of incoming information from creating a poverty of attention or an overload of fluffy kitten pictures. Instead, the wealth of information will lead directly to a rapidly created wealth of actionable knowledge, thereby allowing us to quickly achieve decision superiority.