How much content can you actually re-use when you move to single sourcing?

One of the challenges when considering moving to a single sourcing authoring environment, such as DITA, is determining the Return on Investment. This often boils down to a key question: how much content can you actually re-use?

Organisations typically attempt to answer this question in a number of ways:

  • Conducting a semi-manual information audit of the existing content to identify the number of times the same chunks of information is repeated. Unfortunately, this can be a large and lengthly exercise.
  • If the content is translated, getting reports from Translation Memory tools indicating where content might be repeated. Unfortunately, if you’re not translating your content, you won’t have this information.
  • Using benchmark industry measures. Unfortunately, these can vary enormously (from 25% to 75% content re-use), and your situation may be totally different.

In an ideal world, you’d be able use an application that could look at all your content and give you a report telling you the where content is repeated. It could do the “heavy lifting” in the information audit automatically for you. This programmatic analysis of reuse within existing content, at an affordable cost, is now starting to become possible.

A great deal of research has been carried out, in the fields of computational linguistics and probability, into using n-grams to find common content. An n-gram is a contiguous sequence of letters or words in a piece of text.

Unfortunately, this approach has been impracticable for real-world, large amounts of content: it’s required too much computing in-memory. It’s also been complicated by other factors; according to Paul D. Clough:

In most cases, text is reused under certain constraints that cause the text to be rewritten (e.g. time, space, change of tense etc.). Accurately measuring text reuse therefore involves identifying not only verbatim text, but also text that has undergone a number of, potentially complex, rewriting transformations.

The good news is, with recent developments in the field of big data, it’s becoming easier to tackle large, unstructured data analysis such as we require. For example:

Proprietary solution providers are also responding. For example, DCL has announced Harmonizer On Demand, an online portal to its Harmonizer proprietary “content redundancy solution”.

Although it’s hard to determine how quickly they will appear, it seems likely we will see considerable developments in the technology we can use for identifying and quantifying common re-usable content. As a result, it will be a lot easier for organisations to measure the benefits they will get by moving to a single sourcing authoring environment.

See also:



Thank you Mari-Louise. The problem with those metrics is they rely on the content being in a repository. You have to have committed to a pilot.

Larry Kunz

Fortuitously, the percentage of reused content will go up as it becomes easier to find and identify reusable content. In other words, the effort taken to measure the benefit of reuse will itself increase the benefit. Nice.

John Tait

OXygen has a wonderful DITA Map Matric Report transformation, which reports like this:

Content reuse
Total reused words (words in conref content) nnn. Content reuse percentage (words) is nn.nn%.

Total reused elements (elements in conref content) nn. Elements reuse percentage is nn.nn%.

Total content reference elements nn.

Reused elements: [list of elelments and count]

You can also select “Search references” on an element that’s been refered to by a conref, to see all the places where it has been used. (I’ve managed to crash Oxygen a couple of time with it though.)

You _don’t_ need a repository at all. DITA is pretty much a plain text database and CMS _all by itself_.

Content reuse isn’t necessarily good if all you have is a big tangle. It does offer a well-needed way out of the commonly used cut-and-paste approach.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.