The ways to retrieve text from the document are:
· Use Document.Save with SaveFormat.Text to save as plain text into a file or stream.
· Use Node.ToTxt. Internally, this invokes save as text into a memory stream and returns the resulting string.
· Use Node.GetText to retrieve text with all Microsoft Word control characters including field codes.
· Implement a custom DocumentVisitor to perform customized extraction.
A Word document can contains control characters that designate special elements such as field, end of cell, end of section etc. The full list of possible Word control characters is defined in the ControlChar class. The Node.GetText method returns text with all of the control character characters present in the node.
Example
Shows the difference between calling the GetText and ToString methods on a node.
[Java]
Document doc = new Document();
// Enter a dummy field into the document.
DocumentBuilder builder = new DocumentBuilder(doc);
builder.insertField("MERGEFIELD Field");
// GetText will retrieve all field codes and special characters
System.out.println("GetText() Result: " + doc.getText());
// ToString will export the node to the specified format. When converted to text it will not retrieve fields code
// or special characters, but will still contain some natural formatting characters such as paragraph markers etc.
// This is the same as "viewing" the document as if it was opened in a text editor.
System.out.println("ToString() Result: " + doc.toString(SaveFormat.TEXT));
This example saves the document as follows:
· Filters out field characters and field codes, shape, footnote, endnote and comment references.
· Replaces end of paragraph ControlChar.Cr characters with ControlChar.CrLf combinations.
· Uses UTF8 encoding.
Example
Shows how to save a document in TXT format.
[Java]
Document doc = new Document(getMyDir() + "Document.doc");
doc.save(getMyDir() + "Document.ConvertToTxt Out.txt");