Extract Data from Fields in a MS Word document
This recipe will show how to use Content Script APIs to programmatically read structured data stored in XML binding fields within an MS Word document.
MS Word documents can include user-editable fields bound to an underlying data structure (XML binding). This feature can be very useful to support a bidirectional flow of information between an information system and a document: data can be generated on the system and pushed to the document, but it is possible for data to be collected in the document by an editor and later extracted by the system once the file is uploaded.
More about XML binding in MS Word documents can be found here.
This would allow, for example, to keep the system metadata related to the document and the document content synchronized in a totally automated fashion.
In order to set up a process leveraging this technology, you will need:
- An .xlsx MS Word document that includes a custom XML binding with a known structure.
- Content Script code capable of reading the data from the binding and use it within your applications
00000
Sample title ACME Corporation
Sample author
1000
Open
Finance
Once associated to an MS Word document, the above XML data structure would look like this, if inspected through the MS Word editor “Developer” tab.
Dynamic fields can be added to the document and bound to the XML data – this would allow to automatically populate portions of the document with the structured data
Data inserted within fields by a user editing the document is automatically stored within the XML data structure:
As anticipated, Content Script includes APIs that can be used to access the XML data stored within the MS Word document.
def newNode = asCSNode(newNodeID)
try{
// docx document
if(newNode.subtype == 144 && newNode.mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document"){
// Load contents from a Docx file for processing
def doc = docx.loadWordDoc(newNode)
// Fetches and parses a custom-xml databinding file identified by the given name
// In this example the xml file to be read in the word document is 'item1.xml' (Template offer.docx/customXml/item1.xml)
// The xml file may differ depending on the changes made to the document
def xml = doc.getCustomXmlBinding("item1.xml")
// Read value of xml fields and update metadata
if(newNode."Controlled Document"){
newNode."Controlled Document"."Doc ID" = xml.docID[0].text().replaceAll("[\n\r]", "")
newNode."Controlled Document"."Title" = xml.title[0].text()
newNode."Controlled Document"."Author" = xml.author[0].text()
newNode."Controlled Document"."Owner" = (xml.owner[0].text() ? (xml.owner[0].text() as Long) : null)
newNode."Controlled Document"."Status" = xml.status[0].text()
newNode."Controlled Document"."Department" = xml.department[0].text()
newNode.update()
}
}
}catch(e){
log.error("Unable to read data for the document ${newNodeID}",e)
}
Featured Content
Expense Management
Read more >
HR Correspondence Management
Read more >
Construction Project RFI
Read more >