Extract Data from Fields in a MS Word document

This recipe will show how to use Content Script APIs to programmatically read structured data stored in XML binding fields within an MS Word document.

MS Word documents can include user-editable fields bound to an underlying data structure (XML binding). This feature can be very useful to support a bidirectional flow of information between an information system and a document: data can be generated on the system and pushed to the document, but it is possible for data to be collected in the document by an editor and later extracted by the system once the file is uploaded.

More about XML binding in MS Word documents can be found here.

This would allow, for example, to keep the system metadata related to the document and the document content synchronized in a totally automated fashion.

In order to set up a process leveraging this technology, you will need:

  • An .xlsx MS Word document that includes a custom XML binding with a known structure.
  • Content Script code capable of reading the data from the binding and use it within your applications
<?xml version="1.0" encoding="UTF-8"?>
<controlledDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
					xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
					xmlns="http://custom.answermodules.com/controlledDocument">
	<docID>00000</docID>
	<title>Sample title ACME Corporation</title>
	<author>Sample author</author>
	<owner>1000</owner>
	<status>Open</status>
	<department>Finance</department>
</controlledDocument>

Once associated to an MS Word document, the above XML data structure would look like this, if inspected through the MS Word editor “Developer” tab.

Dynamic fields can be added to the document and bound to the XML data – this would allow to automatically populate portions of the document with the structured data

Data inserted within fields by a user editing the document is automatically stored within the XML data structure:

As anticipated, Content Script includes APIs that can be used to access the XML data stored within the MS Word document.

def newNode = asCSNode(newNodeID)

try{

    // docx document
    if(newNode.subtype == 144 && newNode.mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document"){ 
        
        // Load contents from a Docx file for processing
        def doc = docx.loadWordDoc(newNode)

        // Fetches and parses a custom-xml databinding file identified by the given name
        // In this example the xml file to be read in the word document is 'item1.xml'  (Template offer.docx/customXml/item1.xml)
        // The xml file may differ depending on the changes made to the document
        def xml = doc.getCustomXmlBinding("item1.xml")

        // Read value of xml fields and update metadata
        if(newNode."Controlled Document"){
            
            newNode."Controlled Document"."Doc ID"     = xml.docID[0].text().replaceAll("[\n\r]", "")
            newNode."Controlled Document"."Title"      = xml.title[0].text()
            newNode."Controlled Document"."Author"     = xml.author[0].text()
            newNode."Controlled Document"."Owner"      = (xml.owner[0].text() ? (xml.owner[0].text() as Long) : null)
            newNode."Controlled Document"."Status"     = xml.status[0].text()
            newNode."Controlled Document"."Department" = xml.department[0].text()
            
            newNode.update()
        }
    }

}catch(e){
    log.error("Unable to read data for the document ${newNodeID}",e)
}