You can download the complete source code of the ExtractContent sample here.
A common requirement when working with documents is to easily extract specific content from a range within the document. This content can consist of complex features such as paragraphs, tables, images etc. Regardless of what content needs to extracted, the method in which to extract this content will always be determined by which nodes are chosen to extract content between. These could be entire bodies of text or simple runs of text. There are many possible situations and therefore many different node types to consider when extracting content. For instance, you may want to extract content between:
· Two specific paragraphs in the document.
· Specific runs of text.
· Different types of fields, for example merge fields.
· Between the start and end ranges of a bookmark or comment.
· Different bodies of text contained in separate sections.
In some situations you may even want to combine the different types of, for example, extract content between a paragraph and field, or between a run and a bookmark.
Often the goal of extracting this content is to duplicate or save it separately into a new document. For example, you may wish to extract content and:
· Copy it to a separate document.
· Rendered a specific portion of a document to PDF or an image.
· Duplicate the content in the document many times.
· Work with this content separate from the rest of the document.
This is easy to achieve using Aspose.Words and the code implementation below. This article provides the full code implementation to achieve this along with samples of common scenarios using this method. These samples are just a few demonstrations of the many possibilities that this method can be used for. Some day this functionality will be a part of the public API and the extra code here will not be required. Feel free to post your requests regarding this functionality on the Aspose.Words forum here.
The code in this article addresses all of the possible situations above with one generalized and reusable method.
The general outline of this technique involves:
1. Gathering the nodes which dictate the area of content that will be extracted from your document. Retrieving these nodes is handled by the user in their code, based on what they want to be extracted.
2. Passing these nodes to the ExtractContent method which is provided below. You must also pass a boolean parameter which states if these nodes that act as markers should be included in the extraction or not.
3. The method will return a list of cloned (copied nodes) of the content specified to be extracted. You can now use this in any way applicable, for example, creating a new document containing only the selected content.
We will work with this document below in this article. As you can see it contains a variety of content. Also note, the document contains a second section beginning in the middle of the first page. A bookmark and comment are also present in the document but are not visible in the screenshot below.
To extract the content from your document you need to call the ExtractContent method below and pass the appropriate parameters.
The underlying basis of this method involves finding block level nodes (paragraphs and tables) and cloning them to create identical copies. If the marker nodes passed are block level then the method is able to simply copy the content on that level and add it to the array.
However if the marker nodes are inline (a child of a paragraph) then the situation becomes more complex, as it is necessary to split the paragraph at the inline node, be it a run, bookmark fields etc.
Content in the cloned parent nodes not present between the markers is removed. This process is used to ensure that the inline nodes will still retain the formatting of the parent paragraph.
The method will also run checks on the nodes passed as parameters and throws an exception if either node is invalid.
The parameters to be passed to this method are:
1. StartNode and EndNode:
The first two parameters are the nodes which define where the extraction of the content is to begin and to end at respectively. These nodes can be both block level (Paragraph, Table) or inline level (e.g Run, FieldStart, BookmarkStart etc.).
a. To pass a field you should pass the corresponding FieldStart object.
b. To pass bookmarks, the BookmarkStart and BookmarkEnd nodes should be passed.
c. To pass comments, the CommentRangeStart and CommentRangeEnd nodes should be used.
2. IsInclusive:
Defines if the markers are included in the extraction or not. If this option is set to false and the same node or consecutive nodes are passed, then an empty list will be returned.
a. If a FieldStart node is passed then this option defines if the whole field is to be included or excluded.
b. If a BookmarkStart or BookmarkEnd node is passed, this option defines if the bookmark is included or just the content between the bookmark range.
c. If a CommentRangeStart or CommentRangeEnd node is passed, this option defines if the comment itself is to be included or just the content in the comment range.
The implementation of the ExtractContent method is found below. This method will be referred to in the scenarios in this article.
Example
This is a method which extracts blocks of content from a document between specified nodes.
[Java]
/**
* Extracts a range of nodes from a document found between specified markers and returns a copy of those nodes. Content can be extracted
* between inline nodes, block level nodes, and also special nodes such as Comment or Boomarks. Any combination of different marker types can used.
*
* @param startNode The node which defines where to start the extraction from the document. This node can be block or inline level of a body.
* @param endNode The node which defines where to stop the extraction from the document. This node can be block or inline level of body.
* @param isInclusive Should the marker nodes be included.
*/
public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) throws Exception
{
// First check that the nodes passed to this method are valid for use.
verifyParameterNodes(startNode, endNode);
// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();
// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;
// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.getParentNode().getNodeType() != NodeType.BODY)
startNode = startNode.getParentNode();
while (endNode.getParentNode().getNodeType() != NodeType.BODY)
endNode = endNode.getParentNode();
boolean isExtracting = true;
boolean isStartingNode = true;
boolean isEndingNode;
// The current node we are extracting from the document.
Node currNode = startNode;
// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting)
{
// Clone the current node and its children to obtain a copy.
CompositeNode cloneNode = (CompositeNode)currNode.deepClone(true);
isEndingNode = currNode.equals(endNode);
if(isStartingNode || isEndingNode)
{
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode)
{
processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}
// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode)
{
processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
}
else
// Node is not a start or end marker, simply add the copy to the list.
nodes.add(cloneNode);
// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting)
{
// Move to the next section.
Section nextSection = (Section)currNode.getAncestor(NodeType.SECTION).getNextSibling();
currNode = nextSection.getBody().getFirstChild();
}
else
{
// Move to the next node in the body.
currNode = currNode.getNextSibling();
}
}
// Return the nodes between the node markers.
return nodes;
}
We will also define a custom method to easily generate a document from extracted nodes. This method is used in many of the scenarios below and simply creates a new document and imports the extracted content into it.
Example
This method takes a list of nodes and inserts them into a new document.
[Java]
public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception
{
// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.getFirstSection().getBody().removeAllChildren();
// Import each node from the list into the new document. Keep the original formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
for (Node node : (Iterable<Node>) nodes)
{
Node importNode = importer.importNode(node, true);
dstDoc.getFirstSection().getBody().appendChild(importNode);
}
// Return the generated document.
return dstDoc;
}
These helper methods below are internally called by the main extraction method. They are required, however as they are not directly called by the user, it is not necessary to discuss them further.
Example
The helper methods used by the ExtractContent method.
[Java]
/**
* Checks the input parameters are correct and can be used. Throws an exception if there is any problem.
*/
private static void verifyParameterNodes(Node startNode, Node endNode) throws Exception
{
// The order in which these checks are done is important.
if (startNode == null)
throw new IllegalArgumentException("Start node cannot be null");
if (endNode == null)
throw new IllegalArgumentException("End node cannot be null");
if (!startNode.getDocument().equals(endNode.getDocument()))
throw new IllegalArgumentException("Start node and end node must belong to the same document");
if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null)
throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body");
// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check their position in the body of the same section they are in.
Section startSection = (Section)startNode.getAncestor(NodeType.SECTION);
Section endSection = (Section)endNode.getAncestor(NodeType.SECTION);
int startIndex = startSection.getParentNode().indexOf(startSection);
int endIndex = endSection.getParentNode().indexOf(endSection);
if (startIndex == endIndex)
{
if (startSection.getBody().indexOf(startNode) > endSection.getBody().indexOf(endNode))
throw new IllegalArgumentException("The end node must be after the start node in the body");
}
else if (startIndex > endIndex)
throw new IllegalArgumentException("The section of end node must be after the section start node");
}
/**
* Checks if a node passed is an inline node.
*/
private static boolean isInline(Node node) throws Exception
{
// Test if the node is desendant of a Paragraph or Table node and also is not a paragraph or a table a paragraph inside a comment class which is decesant of a pararaph is possible.
return ((node.getAncestor(NodeType.PARAGRAPH) != null || node.getAncestor(NodeType.TABLE) != null) && !(node.getNodeType() == NodeType.PARAGRAPH || node.getNodeType() == NodeType.TABLE));
}
/**
* Removes the content before or after the marker in the cloned node depending on the type of marker.
*/
private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive, boolean isStartMarker, boolean isEndMarker) throws Exception
{
// If we are dealing with a block level node just see if it should be included and add it to the list.
if(!isInline(node))
{
// Don't add the node twice if the markers are the same node
if(!(isStartMarker && isEndMarker))
{
if (isInclusive)
nodes.add(cloneNode);
}
return;
}
// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
if (node.getNodeType() == NodeType.FIELD_START)
{
// If the marker is a start node and is not be included then skip to the end of the field.
// If the marker is an end node and it is to be included then move to the end field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive))
{
while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)
node = node.getNextSibling();
}
}
// If either marker is part of a comment then to include the comment itself we need to move the pointer forward to the Comment
// node found after the CommentRangeEnd node.
if (node.getNodeType() == NodeType.COMMENT_RANGE_END)
{
while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT)
node = node.getNextSibling();
}
// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have been removed. Subtract the
// difference to get the right index.
int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();
// Child node count identical.
if (indexDiff == 0)
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));
else
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);
// Remove the nodes up to/from the marker.
boolean isSkip;
boolean isProcessing = true;
boolean isRemoving = isStartMarker;
Node nextNode = cloneNode.getFirstChild();
while (isProcessing && nextNode != null)
{
Node currentNode = nextNode;
isSkip = false;
if (currentNode.equals(node))
{
if (isStartMarker)
{
isProcessing = false;
if (isInclusive)
isRemoving = false;
}
else
{
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}
nextNode = nextNode.getNextSibling();
if (isRemoving && !isSkip)
currentNode.remove();
}
// After processing the composite node may become empty. If it has don't include it.
if (!(isStartMarker && isEndMarker))
{
if (cloneNode.hasChildNodes())
nodes.add(cloneNode);
}
}
This demonstrates how to use the method above to extract content between specific paragraphs. In this case, we want to extract the body of the letter found in the first half of the document.
We can tell that this is between the 7th and 11th paragraph.
The code below accomplishes this task. The appropriate paragraphs are extracted using the CompositeNode.GetChild method on the document and passing the specified indices. We then pass these nodes to the ExtractContent method and state that these are to be included in the extraction. This method will return the copied content between these nodes which are then inserted into a new document.
Example
Shows how to extract the content between specific paragraphs using the ExtractContent method above.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// Gather the nodes. The GetChild method uses 0-based index
Paragraph startPara = (Paragraph)doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, true);
Paragraph endPara = (Paragraph)doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endPara, true);
// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save(gDataDir + "TestFile.Paragraphs Out.doc");
The output document which contains the two paragraphs that were extracted.
We can extract content between any combinations of block level or inline nodes. In this scenario below we will extract the content between first paragraph and the table in the second section inclusively. We get the markers nodes by calling Body.FirstParagraph and CompositeNode.GetChild method on the second section of the document to retrieve the appropriate Paragraph and Table nodes.
For a slight variation let’s instead duplicate the content and insert it below the original.
Example
Shows how to extract the content between a paragraph and table using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
Paragraph startPara = (Paragraph)doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true);
Table endTable = (Table)doc.getLastSection().getChild(NodeType.TABLE, 0, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endTable, true);
// Lets reverse the array to make inserting the content back into the document easier.
Collections.reverse(extractedNodes);
while (extractedNodes.size() > 0)
{
// Insert the last node from the reversed list
endTable.getParentNode().insertAfter((Node)extractedNodes.get(0), endTable);
// Remove this node from the list after insertion.
extractedNodes.remove(0);
}
// Save the generated document to disk.
doc.save(gDataDir + "TestFile.DuplicatedContent Out.doc");
The content between the paragraph and table has been duplicated below the original.
You may need to extract the content between paragraphs of the same or different style, such as between paragraphs marked with heading styles.
The code below shows how to achieve this. It is a simple example which will extract the content between the first instance of the “Heading 1” and “Header 3” styles without extracting the headings as well. To do this we set the last parameter to false, which specifies that the marker nodes should not be included.
In a proper implementation this should be run in a loop to extract content between all paragraphs of these styles from the document. This scenario uses the ParagraphsByStyleName method from the ExtractContentBasedOnStyle sample found here. The extracted content is copied into a new document.
Example
Shows how to extract content between paragraphs with specific styles using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// Gather a list of the paragraphs using the respective heading styles.
ArrayList parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1");
ArrayList parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3");
// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node)parasStyleHeading1.get(0);
Node endPara1 = (Node)parasStyleHeading3.get(0);
// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara1, endPara1, false);
// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save(gDataDir + "TestFile.Styles Out.doc");
You can extract content between inline nodes such as a Run as well. Runs from different paragraphs can be passed as markers.
The code below shows how to extract specific text in-between the same Paragraph node.
Example
Shows how to extract content between specific runs of the same paragraph using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// Retrieve a paragraph from the first section.
Paragraph para = (Paragraph)doc.getChild(NodeType.PARAGRAPH, 7, true);
// Use some runs for extraction.
Run startRun = para.getRuns().get(1);
Run endRun = para.getRuns().get(4);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startRun, endRun, true);
// Get the node from the list. There should only be one paragraph returned in the list.
Node node = (Node)extractedNodes.get(0);
// Print the text of this node to the console.
System.out.println(node.toString(SaveFormat.TEXT));
The extracted text displayed on the console.
To use a field as marker, the FieldStart node should be passed. The last parameter to the ExtractContent method will define if the entire field is to be included or not.
Let’s extract the content between the “FullName” merge field and a paragraph in the document. We use the DocumentBuilder.MoveToMergeField(String, Boolean, Boolean) method of DocumentBuilder class. This will return the FieldStart node from the name of merge field passed to it. We will then
In our case let’s set the last parameter passed to the ExtractContent method to false to exclude the field from the extraction. We will render the extracted content to PDF.
Example
Shows how to extract content between a specific field and paragraph in the document using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// Use a document builder to retrieve the field start of a merge field.
DocumentBuilder builder = new DocumentBuilder(doc);
// Pass the first boolean parameter to get the DocumentBuilder to move to the FieldStart of the field.
// We could also get FieldStarts of a field using GetChildNode method as in the other examples.
builder.moveToMergeField("Fullname", false, false);
// The builder cursor should be positioned at the start of the field.
FieldStart startField = (FieldStart)builder.getCurrentNode();
Paragraph endPara = (Paragraph)doc.getFirstSection().getChild(NodeType.PARAGRAPH, 5, true);
// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = extractContent(startField, endPara, false);
// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save(gDataDir + "TestFile.Fields Out.pdf");
The extracted content between the field and paragraph, without the field and paragraph marker nodes rendered to PDF.
In a document the content that is defined within a bookmark is encapsulated by the BookmarkStart and BookmarkEnd nodes. Content found between these two nodes make up the bookmark. You can pass either of these nodes as any marker, even ones from different bookmarks, as long as the starting marker appears before the ending marker in the document.
In our sample document we have one bookmark, named “Bookmark1”. The content of this bookmark is highlighted content in our document:
We will extract this content into a new document using the code below. The IsInclusive parameter option shows how to retain or discard the bookmark.
Example
Shows how to extract the content referenced a bookmark using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// Retrieve the bookmark from the document.
Bookmark bookmark = doc.getRange().getBookmarks().get("Bookmark1");
// We use the BookmarkStart and BookmarkEnd nodes as markers.
BookmarkStart bookmarkStart = bookmark.getBookmarkStart();
BookmarkEnd bookmarkEnd = bookmark.getBookmarkEnd();
// Firstly extract the content between these nodes including the bookmark.
ArrayList extractedNodesInclusive = extractContent(bookmarkStart, bookmarkEnd, true);
Document dstDoc = generateDocument(doc, extractedNodesInclusive);
dstDoc.save(gDataDir + "TestFile.BookmarkInclusive Out.doc");
// Secondly extract the content between these nodes this time without including the bookmark.
ArrayList extractedNodesExclusive = extractContent(bookmarkStart, bookmarkEnd, false);
dstDoc = generateDocument(doc, extractedNodesExclusive);
dstDoc.save(gDataDir + "TestFile.BookmarkExclusive Out.doc");
The extracted output with the IsInclusive parameter set to true. The copy will retain the bookmark as well.
The extracted output with the IsInclusive parameter set to false. The copy contains the content but without the bookmark.
A comment is made up of the CommentRangeStart, CommentRangeEnd and Comment nodes. All of these nodes are inline. The first two nodes encapsulate the content in the document which is referenced by the comment, as seen in the screenshot below.
The Comment node itself is an InlineStory that can contain paragraphs and runs. It represents the message of the comment as seen as a comment bubble in the review pane. As this node is inline and a descendant of a body you can also extract the content from inside this message as well.
In our document we have one comment. Let’s display it by showing markup in the Review tab:
The comment encapsulates the heading, first paragraph and the table in the second section.
Let’s extract this comment into a new document. The IsInclusive option dictates if the comment itself is kept or discarded. The code to do this is below.
Example
Shows how to extract content referenced by a comment using the ExtractContent method.
[Java]
// Load in the document
Document doc = new Document(gDataDir + "TestFile.doc");
// This is a quick way of getting both comment nodes.
// Your code should have a proper method of retrieving each corresponding start and end node.
CommentRangeStart commentStart = (CommentRangeStart)doc.getChild(NodeType.COMMENT_RANGE_START, 0, true);
CommentRangeEnd commentEnd = (CommentRangeEnd)doc.getChild(NodeType.COMMENT_RANGE_END, 0, true);
// Firstly extract the content between these nodes including the comment as well.
ArrayList extractedNodesInclusive = extractContent(commentStart, commentEnd, true);
Document dstDoc = generateDocument(doc, extractedNodesInclusive);
dstDoc.save(gDataDir + "TestFile.CommentInclusive Out.doc");
// Secondly extract the content between these nodes without the comment.
ArrayList extractedNodesExclusive = extractContent(commentStart, commentEnd, false);
dstDoc = generateDocument(doc, extractedNodesExclusive);
dstDoc.save(gDataDir + "TestFile.CommentExclusive Out.doc");
Firstly the extracted output with the IsInclusive parameter set to true. The copy will contain the comment as well.
Secondly the extracted output with isInclusive set to false. The copy contains the content but without the comment.