Deploying data mining in cross-border investigative journalism

datamining Over the past few years we have seen the huge potential of data and document mining in investigative journalism. Tech savvy networks of journalists such as the Organized Crime and Corruption Reporting Project (OCCRP) and the International Consortium of Investigative Journalists (ICIJ) have teamed together for astounding cross-border investigations, such as OCCRP’s work on money laundering or ICIJ’s offshore leak projects. OCCRP has even incubated its own tools, such as VIS, Investigative Dashboard and Overview.

But we need to do better. There is enormous duplication and missed opportunity in investigative journalism software. Many small grants for technology development have led to many new tools, but very few have become widely used. For example, there are now over 70 tools just for social network analysis. There are other tools for other types of analysis, document handling, data cleaning, and on and on. Most of these are open source, and in various states of completeness, usability, and adoption. Developer teams lack critical capacities such as usability testing, agile processes, and business development for sustainability. Many of these tools are beautiful solutions in search of a problem.

The fragmentation of software development for investigative journalism has consequences: Most newsrooms still lack capacity for very basic knowledge management tasks, such as digitally filing new documents where they can be searched and found later. Tools do not work or do not interoperate. Ultimately the reporting work is slower, or more expensive, or doesn't get done. Meanwhile, the commercial software world has so far ignored investigative journalism because it is a small, specialized user-base. Tools like Nuix and Palantir are expensive, not networked, and not extensible for the inevitable story-specific needs.

But investigative journalists have learned how to work in cross-border networks, and investigative journalism developers can too. The experience gained from collaborative data-driven journalism has led OCCRP and other interested organizations to focus on the following issues:

Usability. We can no longer afford to build software that no one wants. Most investigative reporters still don’t have support for basic tasks such as filing new information in a shared digital repository, reviewing documents and making notes, or searching for a list of company names. To get more journalism done faster we need to understand and support these core workflows, with frequent user visits and testing. Advanced features can only succeed on top of such an infrastructural base.

Delivery. We need to think of ourselves first as systems integrators, not developers, and focus on packaging existing platforms together in highly usable ways for non-technical end users. In this way we gain the experience that tells us what new code needs to be written. Experience has already taught us that we need to support both a centralized website (because it vastly lowers the barriers to use) and independently deployable servers (which many users need for security reasons.)

Networked investigation. Reporters need to know if other organizations have information on specific people and companies, which requires a federated search mechanism. If there’s a hit, then the reporter can negotiate to see the original material. This two-step process has come to be known as the who’s got dirt? model and has achieved broad consensus in the cross-border investigation community.

Sustainability. Who's going to pay for all of this after donors move on? We believe in covering at least the marginal costs from the outset, e.g. software-as-a-service pricing. This will not immediately fund ongoing development costs, but the opportunity to learn what people will pay for is essential to developing new markets. This has been ignored for far too long.

Interoperability and extensibility. The Influence Mappers project is defining consensus standards for social-network type structured data and we should support them. Overview has demonstrated the enormous project-specific value in an extensible analysis API. And the software itself should be open source to enable collaboration and prevent monopolies and vendor-lock in.

What we are advocating is a federated information architecture for investigative journalism, something that people have been talking about for a long time. Two things have vastly improved our prospects for success. First, a critical mass of developers and users are now talking. Second, successes with existing systems have helped to define and scope the project. We have produced useful components and demonstrated interoperability strategies.

The Influence Mappers mailing list has managed to consolidate everyone with an interest in journalistic analysis of social networks and is working to define interchange standards. OpenCorporates continues to grow as a master repository of company registration. Investigative Dashboard has established itself as a valuable research service for the European journalism community and is attacking the problem of data warehousing. DocumentCloud has succeeded wonderfully as a document repository and publication platform. Overview has demonstrated how to do extensible analytics on large document sets, with its visualization plugin API. And the international journalism community as a whole has gained experience in cross-border collaborations, producing wide consensus on the value of the "who's got dirt?" model of federated search.

Much work remains to be done in terms of the usability of existing software, collaboration between development teams, sustainability planning, and so on. But the common goals, listed above, are an important start. We are not aiming for the moon, but a fairly well defined and previously tested set of critical features.

The next step for us is a small meeting: the very first conference on Knowledge Management in Investigative Journalism. This event will bring together key developers and journalists to refine the problem definition and plan a way forward. OCCRP and the Influence Mappers project have already pledged support. Stay tuned...

But we are already talking. A draft of this post was circulated among OCCRP, the Global Investigative Journalism Network, the ICIJ, Overview, Document Cloud, Global Witness, and Open Corporates, who all agreed that the problem we’ve identified are real and need to be solved. We have agreed to engage in a discussion of needs and solutions.  Let us know if you are interested too.

-Jonathan Stray – Overview (

- Drew Sullivan – OCCRP  (