The next time you travel by train, try to take a closer look at the wheels. There is something very special going on that is often overlooked. Whether your train is an old subway car, a commuter train, or a high-speed train, you are looking at evidence of amazing connections. Of course, the train is connected to the rail via the wheels (I am not a train expert, so please no emails about the use of terminology here), but if you reflect on it further, the rails are connected to switches that branch out to different tracks. Those tracks connect to cities. Some of those cities have airports and seaports. That one wheel is connected to an entire transportation network. Looking inward, the wheel also represents connections. There are the various mechanical assemblies. Those assemblies connect to cars, which are coupled together. In a sense, all of the connected pieces of the train, as well as the people and cargo on them are connected. For the duration of the journey, everything connected in that train is going from point A to point B. Trains provide an excellent metaphor for thinking about relationships in data, which not only extend understanding but constrain it in important aspects. Connectedness, or how one thing relates to another, is a property of data that is rising in importance in ways that both amaze us and scare us in the modern age.
First Order Relationships: What I see, touch, perceive.
The most obvious way of looking at relationships among disparate pieces of data is through direct linkage. Sometimes, this linkage is within data-sets, for example when a cell in a spreadsheet contains a calculation that relates to other cells in the spreadsheet. Other obvious data relationships include foreign keys, when a reference in one data set enables introduction of data from another data set. An example of a foreign key might be an employee ID stored in a payroll file, which can be used to bring in data about recent training that an employee has completed from the training databases.
In our daily life, we are surrounded by first-order relationships. I was once driving on a toll road and went through an automatic toll booth. The screen to the left of the booth, that normally flashes a green light, displayed a “please call” message. I used my hands-free dialing and called the company, only to get a message telling me the offices were closed for the weekend.
The following Monday, I tried to call again, ready to inform them of the tolls I passed in case there was an issue. I was asked for my tag number. I offered my car license plate number and the agent told me he needed my tag number. When I asked what that was, he told me that everyone has a tag number (not helpful). Getting nowhere, I asked if there was anything else that I could provide, such as my name (which I knew at the time, but was beginning to forget in a moment of frustration). Apparently, I could also provide my account number and PIN, which were on my bill, which I didn’t have. Eventually, I learned that the tag number was on the plastic device on my windshield. I retrieved the number, which was 20 or so numbers and letters, and after reading this number to the agent I was “verified”. Relationship established, I started to explain the problem when I was interrupted by the agent. It seems there was a “well-known system outage” and I shouldn’t be concerned. Sometimes relationships only serve to slow down the process.
We have all had versions of the story above, where we need to conduct a simple piece of business but we are stopped by the systems put in place to ensure authentication and validation. Authentication and validation are important to establish two critical points: 1) you are who you claim to be and 2) you are authorized to do whatever you purport to do. Authentication and validation come into play in data relationships when one system or process attempts to share data or transactions with another.
There are many other types of data relationships, including dyadic (two-sided), many-to-many, one-to-many, and many-to-one. These simple relationships can be stored in the data itself, constructed by means of calculations or algorithms, or supplied through user interaction. All of these are what is sometimes referred to as first-order relationships: those which are observable in the data or the process that contains the data. These relationships can be static, dynamic, implicit (derived through calculation or process) and may be simple or complex. Among the more problematic are circular relationships, where a piece of data relates back to itself either directly or through a chain of reference.
It is important to recognize the types of first-order relationships that exist in data, with special attention to those which are implicit and those which are external to a given data set (e.g. foreign keys).
Second Order Relationships: What I infer.
Very often, analysis stops with the first-order relationship. We count things, we do math, we draw visualizations, or we control process with these relationships. But things get even more interesting when we look at second-order relationships, which are the relationships that come from the inference of a first-order relationship.
The first time many of us learn about second-order relationships is when we studied science. For example, in physics, we can observe distance vs. time and understand speed. If we know how long it took me to get from one of those toll booths to the next one, and the distance between, we can calculate my speed. Let’s say the speed limit was 65 miles per hour and I covered 65 miles in one hour. One could argue that I never exceeded the speed limit. That is a first-order observation relating distance and time. The second-order observation would be to calculate the change in distance over time, which we call acceleration. It may be that I traveled the first half of that distance well over the speed limit and then I slowed down and traveled under the speed limit.
In data relationships, these second-order relationships can be fascinating. Imagine you had a compilation of all of the advertisers listed in a brochure for a sales convention. You could look at those companies, their relative sizes, markets served, years in business, majority ownership, press releases, etc. All of these relationships would be different types of first-order observation.
Looking deeper, you might consider some second-order inference. For example, the companies are all advertising in the same space, so at some level they are all competing. They may have different size ads, allowing you to construct a histogram of who paid more or less. They are all advertising in the same event, which makes them a common customer of whomever they paid for the advertising. All of these are second-order observations. Following any of those threads leads to a host of additional relationships and inferences which can underlie powerful analytical decisions.
A word of caution is in order here concerning error and bias. When you do first order observation, there will always be a certain amount of error in your observation, and hence your conclusion. Put simply, a certain percentage of your observations may be wrong, outdated, or incomplete. That error translates into the decision you are making. When you make a second-order observation, there will generally be error in that observation as well for the same kinds of reasons (and often introducing other types of error related to calculation, sample bias, and other factors). The error from second-order observation is generally not additive to the first-order error, but multiplicative. In other words, if your first process was 90% accurate and your second process was 80% accurate, your end-state observation is now 72% accurate! This dirty little secret of second-order inference is responsible for many failed business endeavors.
It is extremely important to remember that anything you do in terms of second-order inference, while extremely enlightening in entirely new ways, can be masking increasing dependence on process integrity and data accuracy.
Higher-order Observation: What it portends.
In some ways, this is a never ending story. For example, you can certainly have third-order observation. Using our car analogy again, we could look at distance verses time (first order), speed vs. time (second order) or, if we wanted to go further we might examine acceleration vs. time. That would be a third-order observation.
Imagine in my scenario where I left the toll booth speeding: I stomped on the gas to accelerate to 100 miles per hour (don’t do this!) – and before reaching the speed limit of 65 miles per hour, I saw a speed trap. One reaction might be to quickly “slow down,” which is a form of negative acceleration. Some days later, in court, I could try to argue that I was never actually speeding, so I shouldn’t get a ticket. Having studied both physics and third-order observation, the judge would quickly dismiss my argument under the umbrella of reckless driving, having all of the data to prove that my acceleration vs. time, a third-order observation, was well outside the performance of acceptable driving. (Please, no comments on the law here, I’m not a lawyer either!).
Higher-order observations in data are often about trends, such as where a market is going, whitespace analysis, macroeconomics (e.g. GDP) or geopolitical concerns (e.g. cybercrime). These sorts of observations are extremely important, and are generally based not only on the data and the discoverable relationships, but also on guiding principles or other systemic knowledge (such as correcting figures for seasonal variation). Of course, these higher-order observations are the kinds of relationships that often drive the biggest decisions in terms of organizational or societal impact, so we are well-advised not only to recognize when we are relying on them, but also to take such observations very carefully and with as much empirical rigor as possible.
As we move to higher order observation, we move to increasingly more systemic understanding of relationships. This transition brings with it both enormous potential and enormous demand for appropriate analytical rigor.