The legacy system was able to produce an HTML rendering of the Word documents it stored. By converting this HTML to XHTML (using the html2xhtml command line tool) and then transforming the XHTML to an S1000D data module using XSLT, it was possible to obtain a rough and ready but valid S1000D document which could be manually enhanced at a later date.
For PDF documents it was decided to create the data module using the text extracted from the PDF using the iText library. Although it was not possible to preserve any of the document’s internal structure in this way, the PDF could also be stored in the CSDB as a binary resource. As an interim solution the users would see the PDF document when accessing the data module until the data module had been enhanced to resemble the PDF, at which point it could be removed.
While this approach worked well for the vast majority of documents there were a small number of very large documents that were later discovered to be too big to manage efficiently in the CSDB. Working with them in an XML editor was slow or even impossible on certain low specification workstations and rendering them for viewing could also be problematic.
Table 3. A word of advice..
With the benefit of hindsight, a review of very large documents should have done before they were converted to S1000D. Breaking these documents up in their original format would have probably been easier than breaking them up after they had been converted to S1000D and loaded into the CSDB. |