The needs which businesses have for versioned history are many and varied. However, those needs and the data patterns which satisfy them seem to fall quite nicely along a linear progression from simple to complex. We will call the patterns that satisfy each set of business needs "version patterns" because, in the terms organized by the taxonomy in Part 1 of this series, we are concerned with queryable history of the states of persistent objects, kept as versions of those objects.
However, our concern will not be exclusively with past and present versions. The more general topic is maintaining and querying tables which contain noncurrent as well as current versions of their persistent objects. And there are other kinds of noncurrent versions than historical ones. We will also consider versions for error correction, versions for future states, versions which overlap in time, coordinated (pending/sandbox) updates and even versions for hypothetical states. 数据挖掘研究院
Version Patterns: Manner of Presentation
In this series of articles, we will discuss a dozen version patterns. Each pattern will be introduced by describing the business requirements it supports. 数据挖掘研究院
Next, we will show how these version patterns work by presenting several scenarios and timelines for each pattern. Each scenario will show the results of doing an insert, update or delete against a table constructed according to the pattern. Each timeline will show what would be returned by a query against the past, present or future state of the object whose versions are being used as the example. 数据挖掘研究院
Initially, these transactions will be simple ones that physically affect just one row in one table. But gradually, more semantically expressive scenarios will be considered, ones in which what is seen as one transaction to the issuing code or person will result in several rows being affected in possibly several tables. 数据挖掘研究院
Following the scenarios for each history pattern, we will present and comment on a list of semantic constraints that must be followed for the pattern to work. Most of these constraints can be enforced only by triggers, stored procedures or application code. It is an indication of how "special" and complex temporal state management is that the normal DBMS entity and referential integrity mechanisms can do very little to enforce the semantics of these patterns.
Finally, each pattern will conclude with a list of objections to it that we have encountered. Some of these objections are valid in the sense that if the different business requirements for the pattern were weighted a little differently, a different pattern or at least a different physical implementation than that suggested by the pattern would be appropriate. Other objections are not valid, either because the pattern was not fully understood by those criticizing it, or because the flaws in the alternative pattern that the critics preferred were not fully understood. 数据挖掘研究院
In this series of articles, it will not be possible to discuss all the objections to these patterns that we are aware of. But we can, and will, at least note the existence of the objections we are aware of.
Following is an annotated list of the version patterns which we will discuss in these articles. 数据挖掘研究院
Figure 1: Annotated List of Version Patterns 数据挖掘研究院
Figure 2 shows the template we will use to present scenarios.
Figure 2: The Scenario Template 数据挖掘研究院
The numbered boxes in the template will contain the following information:
- Name of the scenario.
- Table used for this scenario.
- Definition of the table.
- Description of the scenario.
- Now (whatever date that may be).
- Column headings for the table.
- One or more rows of the table.
Basically, boxes 1 through 5 contain metadata that is necessary to understand the scenario. Boxes 7 are the column headings for the table under discussion. There are one or more rows indicated as boxes 8, and each one represents a row in a relational table. 数据挖掘研究院
Timelines
Except for the first two patterns, we will also use a timeline diagram to explain what is going on. Figure 3 illustrates this diagram. It will be used beginning with Version Pattern 3, which is the first pattern that goes beyond a simple update in place approach to history.
Figure 3: The Timeline Diagram
Health Care Insurance: A Business Example for Version Patterns
Health care insurance is not the only industry which has a need for real-time history. But the insurance industry in general is certainly one in which real-time history is very important. Two paradigmatic insurance situations make the point: processing a claim and answering a customer telephone inquiry. 数据挖掘研究院
As for the first paradigmatic situation, a claim is filed for an event which happened at some point in the past. The claimant has a policy with the insurance company, to which she sends the claim. But the policy which is relevant to the claim is not her policy as it is when the claim is received. Rather, it is her policy as it was when the claimable event occurred.
In property insurance, the claimable event is damage to property. In health care insurance, the claimable event is a service event, an event in which a doctor, hospital or other health care professional or organization provided medical services to the claimant.
Processing a claim involves comparing the claim to a past state of the policy against which the claim is being made. That past state may happen to be identical to the current state of the policy. But the point is that it equally well may not. And, in point of fact, it often is different. So processing a claim requires real-time access to historical data.
There is a customer-facing aspect to real-time claims data, also. And because it is customer-facing, it represents not just a need which a processing system has, but a product differentiator for the insurance company. This is our second paradigmatic situation.
These events occur when a customer calls the insurance company and asks for an explanation of benefits. Usually, what they are concerned with are benefits not provided, i.e., claims not paid or claims that were reimbursed for less than the policyholder thought they should be. 数据挖掘研究院
The customer service representative (CSR) needs online, real-time history to answer the customer's questions during that phone call, there and then. He needs complete information on the caller's policy, as well as the date the service was provided for which the claim the caller is concerned with was made. The need for real-time access to policy history exists because it is important to answer the caller's questions while the caller is on the phone. Companies which cannot do that are at a serious competitive disadvantage.
Other industries will have their own paradigmatic situations illustrating their own needs for real-time access to history. But claims processing is the operational heart of insurance, and customer satisfaction is the operational soul of any business. 数据挖掘研究院
So: historical data is already being managed in insurance company databases. Our intent, in this series of articles, is to present what we believe is a better way of satisfying this need than the various ad hoc solutions we have seen over the past few decades. And better has the bottom-line meaning it always has had: better in terms of revenue enhancement, cost containment and customer service. 数据挖掘实验室
Two additional points will serve to complete our introduction to this topic. The first point illustrates the increasing importance of versioned history by considering the breakdown of a traditional alignment of current versus historical with operational versus analytic. The second point is that dates and timestamps (of varying granularity) are just different ways of representing the tick of a logical clock that provides reference times for all versioned history. 数据挖掘研究院
Operational Data and Analytical Data
Another way of realizing the significance of this topic is to understand it in terms of the breakdown of a traditional alignment. Traditionally, the distinction between operational data and analytical data has lined up with the distinction between current data and historical data. In talking about the increasing need for real-time access to historical data, we are emphasizing that this alignment is breaking down. 数据挖掘研究院
On the one hand, data marts have traditionally been considered part of the analytical environment, not part of the operational environment. Data marts are and always have been a means of discovering trending over time patterns. But near real-time data is increasingly being required in data marts. The reason is that businesses are finding that they can make better analytical decisions if the relevant data reaches right up to the present moment. 数据挖掘研究院
This is a particularly acute need for large-scale inventory management systems of consumer goods, especially during the holiday season. The analytics needed from such systems do not directly concern on-hand, on-order, sales and returns volumes. Rather, the single critical concern is with velocities. The critical question isn't, "How much did we sell last week?" or even "How much do we have on-hand and on-order right now?" The critical question is, "Is our on-hand and our pipeline adequate to keep product in stock, given the rate at which it is being sold?" Since sales volumes for much consumer merchandise peaks during the holiday season, the velocity that decision-makers are concerned with, at that time of year, isn't one based on rates of change reaching back six months. It's one based on rates of change reaching back six days! So this is a situation in which current data is needed for analytical decision-making.
If you are thinking, at this point, that we have just described a need for operational data, not for analytical data, then you are illustrating the confusion caused when an alignment (operational = current, analytical = historical) breaks down. What we have just described is a need for near real-time data. If we put the currency of data aside, however, what distinguishes operational from analytical data? 数据挖掘研究院
One answer might be that analytical data is used by upper-level management, while operational data is used by lower-level management. While this happens to be true, it isn't very informative. What we need to know is what upper-level management is doing with its data that is different from what lower-level management is doing with its data.
The answer is that analytical data attempts to predict the future, while operational data attempts to optimize the present. Analytical data uses the past to establish curves, one of whose axes is time. The more accurate those curves are, the better an indicator they will be of future trends. For the best basis on which to guess what the future will be like is a sufficiently long and accurate timeline of what has happened in the past. Just project the curve into the future.
On the one hand, as our inventory management example illustrated, real-time data is needed for analytical decision-making. Conversely, operational data, whether in OLTP systems or in operational data stores (ODSs), increasingly includes relevant historical data, and operational decision-making increasingly requires access to that data that is just as real-time as is access to current data.
Again, in the insurance industry, the past state of insurance policies is the relevant example. The claim is current data. The relevant state of the policy is historical data. So this is a situation in which historical data is needed for operational decision-making. The old "operational = current, analytical = historical" alignment is breaking down. Historical data is no longer "second-class" data. 数据挖掘实验室
Every Tick of the Clock
Throughout these articles, we will be talking about dates, not timestamps. This is for the sake of simplicity; nothing in these patterns would be disturbed if we were to talk about timestamps instead. However, this simplification means that the granularity at which time will appear in these examples is the granularity of a single day. For example, we will talk about a version ending one day prior to the start of the next version.
More generally, a version ends one tick of the clock before the start of the next version (assuming, as we do for now, that there can be no gaps between versions of the same thing being versioned). That is one tick of whatever "clock" is being used. With dates, every clock tick is one day. However, our discussions apply equally to situations in which timestamps are used instead of dates. With timestamps, the clock ticks once for every change in timestamp value, at the granularity at which the timestamps are recorded. Thus, for example, timestamps can record a clock that ticks in hours, minutes, seconds, milliseconds, etc. 数据挖掘研究院
Using a clock for which each tick is one day isn't at all unrealistic. It supports a business requirement, for example, to load a data warehouse or ODS once a day, with whatever the source data is like at that point in time. For many warehousing situations, a clock that ticks once a day is fine-grained enough to meet all business requirements for the warehouse. 数据挖掘研究院
Note: We will not include century in the dates used in these examples. Instead, we will specify them as "mm/dd/yy." This is to keep the columns as narrow as possible, to fit entire rows across a page.
However, there is one "special date" value that will require a century. It's the value "12/31/9999." Obviously, "12/31/99" cannot be reserved for a special meaning, since it is a date that is likely to appear as real business data. But "12/31/9999" is available to carry a special meaning, on the assumption that no business data will occur that needs to specify that exact day nearly eight-thousand years from now. That special meaning has some of the semantics of "end date unknown," and also some of the semantics of "end of time." But we think that the most accurately descriptive name for this value is "until next changed or deleted." These semantics will be examined more thoroughly in a later installment in this series. The differences are subtle; and sometimes, in the real world of business IT, they are even relevant. 数据挖掘研究院
With this special meaning, "12/31/9999" is not a date; that's why we refer to it as a date value, i.e., a value that appears in a date data type but which does not carry the semantics of a date. (If this appears awkward to you, it is. We need the SQL standards committees, and the vendors, to provide something better for us. But until they do, this is what we have to work with.) 数据挖掘实验室

