One of the challenges when considering moving to a single sourcing authoring environment, such as DITA, is determining the Return on Investment. This often boils down to a key question: how much content can you actually re-use?

Organisations typically attempt to answer this question in a number of ways:

Conducting a semi-manual information audit of the existing content to identify the number of times the same chunks of information is repeated. Unfortunately, this can be a large and lengthly exercise.
If the content is translated, getting reports from Translation Memory tools indicating where content might be repeated. Unfortunately, if you’re not translating your content, you won’t have this information.
Using benchmark industry measures. Unfortunately, these can vary enormously (from 25% to 75% content re-use), and your situation may be totally different.

In an ideal world, you’d be able use an application that could look at all your content and give you a report telling you the where content is repeated. It could do the “heavy lifting” in the information audit automatically for you. This programmatic analysis of reuse within existing content, at an affordable cost, is now starting to become possible.

A great deal of research has been carried out, in the fields of computational linguistics and probability, into using n-grams to find common content. An n-gram is a contiguous sequence of letters or words in a piece of text.

Unfortunately, this approach has been impracticable for real-world, large amounts of content: it’s required too much computing in-memory. It’s also been complicated by other factors; according to Paul D. Clough:

In most cases, text is reused under certain constraints that cause the text to be rewritten (e.g. time, space, change of tense etc.). Accurately measuring text reuse therefore involves identifying not only verbatim text, but also text that has undergone a number of, potentially complex, rewriting transformations.

The good news is, with recent developments in the field of big data, it’s becoming easier to tackle large, unstructured data analysis such as we require. For example:

Richard Marsden has developed an approach for analysing n-grams that offloads much of the in-memory requirements.
Researchers at the University of Sheffield have reported a new approach that identifies both verbatim and reused-but-rewritten common texts.

Proprietary solution providers are also responding. For example, DCL has announced Harmonizer On Demand, an online portal to its Harmonizer proprietary “content redundancy solution”.

Although it’s hard to determine how quickly they will appear, it seems likely we will see considerable developments in the technology we can use for identifying and quantifying common re-usable content. As a result, it will be a lot easier for organisations to measure the benefits they will get by moving to a single sourcing authoring environment.

4 Comments

Marie-Louise Flacke December 3rd, 2012

Regarding content reuse metrics when talking about technical documentation (and XML DITA), you might want to have a look at:

What is the best metric to measure the success of your content reuse?
by Bill Hackos
http://www.infomanagementcenter.com/enewsletter/200806/third.htm

ellis December 3rd, 2012

Thank you Mari-Louise. The problem with those metrics is they rely on the content being in a repository. You have to have committed to a pilot.

Larry Kunz December 3rd, 2012

Fortuitously, the percentage of reused content will go up as it becomes easier to find and identify reusable content. In other words, the effort taken to measure the benefit of reuse will itself increase the benefit. Nice.

John Tait December 4th, 2012

OXygen has a wonderful DITA Map Matric Report transformation, which reports like this:

—————————————
Content reuse
Total reused words (words in conref content) nnn. Content reuse percentage (words) is nn.nn%.

Total reused elements (elements in conref content) nn. Elements reuse percentage is nn.nn%.

Total content reference elements nn.

Reused elements: [list of elelments and count]

—————————————
You can also select “Search references” on an element that’s been refered to by a conref, to see all the places where it has been used. (I’ve managed to crash Oxygen a couple of time with it though.)

You _don’t_ need a repository at all. DITA is pretty much a plain text database and CMS _all by itself_.

Content reuse isn’t necessarily good if all you have is a big tangle. It does offer a well-needed way out of the commonly used cut-and-paste approach.

How much content can you actually re-use when you move to single sourcing?

Related

4 Comments

Leave a ReplyCancel reply

Previous post

Next post

Browse posts by month

Browse posts by category

Share this:

Related

4 Comments

Leave a ReplyCancel reply

Previous post

Next post

Browse posts by month

Browse posts by category