In Our Experts' Opinions: The Altep Blog

With more than 20 years' success in complex eDiscovery management, data forensics, compliance, and investigations, and a team of more than 200 experts throughout the US and in Europe, Altep offers a uniquely valuable perspective. Each month, our blog features a different expert, and offers analysis and commentary on a broad spectrum of topics from data management to cyber security. We hope you find our posts informative; if you'd like to submit a guest post, please feel free to contact us!

What is Continuous Active Learning (CAL), Really? – Part One

Ever since the March 2, 2015 Rio Tinto opinion and order, there has been a lot of buzz in eDiscovery around the phrase “Continuous Active Learning” (CAL). Judge Peck briefly mentioned CAL while summarizing the available case law around seed-set sharing and transparency. For the sake of clarity, the term seed-set in this post refers to the initial group of training documents used to kick off a Technology Assisted Review (TAR) project. We refer to the review sets that follow as training sets. The point of Judge Peck’s mention of CAL, as I understood it, was to alert readers to the possibility that seed-set selection and disclosure disputes may become much less necessary as TAR tools and protocols continue to evolve.

Judge Peck pointed to recent research and a law review article by Maura Grossman and Gordon Cormack to support that notion. Those works made two important points about seed-set documents. First, they asserted that the selection and coding of seed-set documents is less likely to define the ultimate success of TAR projects employing a true CAL protocol. The general theory there is that the influence of misclassified seed documents is fleeting, since the classifier used to identify successive training set documents is recreated after each round, rather than simply revised or refitted. Second, they argued that seed-set transparency is not the guaranteed path to TAR project completeness, since neither the producing nor receiving party has a true understanding of the breadth of the concepts / information types in a collection.

The fact that Judge Peck cited the work of Grossman and Cormack as the basis for his statement is important, because the definition of CAL asserted in those publications is different from what the makers of many TAR tools would offer – even those that claim to be CAL capable. The definition in those publications, in laymen’s terms, comes down to one key differentiator – sampling methodology. Grossman and Cormack advocate for what they call “Relevance Feedback” as the preferred sampling method when looking for training set documents to continue to teach the system.

In short, Relevance Feedback is a sampling approach that focuses on the documents the system is most confident are relevant when selecting training sets. This approach reportedly speeds up the training process, reduces training set volumes, and minimizes the impact of earlier misclassified seed documents. Grossman and Cormack go even further in their published works to say that only systems using Relevance Feedback are true CAL systems. This sampling approach, however, is not widely used in eDiscovery TAR tools, nor do experts agree that it is the best approach, or that Relevance Feedback sampling is a minimum qualification of CAL capable systems.

A fair number of the advanced TAR tools currently on the market rely on a different sampling methodology – specifically, uncertainty sampling. Each tool can and will likely have its own spin on exactly how training sets are selected, but the main premise is that the system focuses most on the documents about which it is least certain of relevance when selecting documents. One reason this sampling method is appealing is that it is thought to be more successful at uncovering pockets of conceptual data which might otherwise have been missed, if the focus were only on those items the system is confident are relevant. Notably, the makers of some TAR tools that utilize uncertainty sampling also classify their wares as CAL capable, while Grossman and Cormack would classify them as Simple Active Learning (SAL), or something less.

So, who is right? What makes one system CAL capable and another not? One helpful distinction to make as we try to answer such questions is the difference between Active Learning (AL) and Passive Learning (PL). Passive Learning TAR tools are those where the selection of training documents is driven by the user. The machine will pull a random or stratified sample of training documents for review, but the selection of those documents is not guided by anything more than the fact that they have not yet been reviewed, and possibly that they are conceptual cluster center documents. Active Learning TAR, whether focusing on certainty or uncertainly, involves machine-selected training documents. The documents that are chosen are intended to confirm the system’s understanding of relevance, or to further expand that understanding. Based on these facts, the difference between PL and AL is pretty clear.

Things get quite a bit less clear when we attempt to differentiate between SAL and CAL, and it looks like there is plenty of room for debate about the minimum qualification for CAL technology. Again, who is right here, and what does that mean for the rest of us looking to use the best available tools? In next week’s post, I’ll talk more about the purported differences between CAL and SAL, and take a closer look at some of the available training document sampling methodologies to try to make more sense of the minimum requirements a tool should meet in order to be properly classified as “CAL.”

The eDiscovery Obstacle Course: A Survival Guide
Surviving Setbacks