One of the challenges when considering moving to a single sourcing authoring environment, such as DITA, is determining the Return on Investment. This often boils down to a key question: how much content can you actually re-use?
Organisations typically attempt to answer this question in a number of ways:
- Conducting a semi-manual information audit of the existing content to identify the number of times the same chunks of information is repeated. Unfortunately, this can be a large and lengthly exercise.
- If the content is translated, getting reports from Translation Memory tools indicating where content might be repeated. Unfortunately, if you’re not translating your content, you won’t have this information.
- Using benchmark industry measures. Unfortunately, these can vary enormously (from 25% to 75% content re-use), and your situation may be totally different.
In an ideal world, you’d be able use an application that could look at all your content and give you a report telling you the where content is repeated. It could do the “heavy lifting” in the information audit automatically for you. This programmatic analysis of reuse within existing content, at an affordable cost, is now starting to become possible.
A great deal of research has been carried out, in the fields of computational linguistics and probability, into using n-grams to find common content. An n-gram is a contiguous sequence of letters or words in a piece of text.
Unfortunately, this approach has been impracticable for real-world, large amounts of content: it’s required too much computing in-memory. It’s also been complicated by other factors; according to Paul D. Clough:
In most cases, text is reused under certain constraints that cause the text to be rewritten (e.g. time, space, change of tense etc.). Accurately measuring text reuse therefore involves identifying not only verbatim text, but also text that has undergone a number of, potentially complex, rewriting transformations.
The good news is, with recent developments in the field of big data, it’s becoming easier to tackle large, unstructured data analysis such as we require. For example:
- Richard Marsden has developed an approach for analysing n-grams that offloads much of the in-memory requirements.
- Researchers at the University of Sheffield have reported a new approach that identifies both verbatim and reused-but-rewritten common texts.
Proprietary solution providers are also responding. For example, DCL has announced Harmonizer On Demand, an online portal to its Harmonizer proprietary “content redundancy solution”.
Although it’s hard to determine how quickly they will appear, it seems likely we will see considerable developments in the technology we can use for identifying and quantifying common re-usable content. As a result, it will be a lot easier for organisations to measure the benefits they will get by moving to a single sourcing authoring environment.
- Reduce your documentation costs and improve your responsiveness with a Content Management System (CMS)
- Google I/O 2011: App Engine MapReduce presentation: