Dave's Notebook

iTextSharp – HTML to PDF – Finishing Up

tiger In the last post I mentioned there were a few topics we need to close up today.  The two topics we’ve left undone are popping the attribute information off the stack when we hit a closing element and dealing with the paragraph gap that normally appears between paragraph elements.

The first thing you’ll want to do when you hit a closing element is to retrieve its name again.  Just like we did at the beginning element.  Once you have that you can pop the attribute information off the stack(s).

You’ll also want to undo any indentation that you applied during the opening element.

To handle the paragraph break, I defined a _crlfAtEnd attribute in my resource file.  If it was defined as true, I added an extra line feed at the end to account for the gap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
isBlock = Resources.html2pdf
.ResourceManager
.GetString(tagName + "_isBlock");
if (isBlock != null &&
isBlock.ToLower() == "true")
{
isBlock = Resources.html2pdf
.ResourceManager
.GetString(tagName + "_crlfAtEnd");
if (isBlock != null &&
isBlock.ToLower() == "true")
{
et = stack.Peek();
Font f = getCurrentFont();
if (et is Phrase)
{
((Phrase)(et)).Add(
new Chunk("\n", f));
stack.Pop();
}
}
p = new Paragraph();
((Paragraph)p).Add("");
((Paragraph)p).SetLeading(m_leading, 1);
list.Add(p);
stack.Push(p);
}

One problem I’ve had with this in the past is that this cr/lf gets added at the end even if the block is the last block.  I really need to find some way to detect that this is the last place this occurs either nested or in the outermost block.  But I’ll leave that enhancement for you.

VB.NET Processing Before WinForm Display

arct-075

I woke up this morning to an interesting question.

“Using VB.net 2008, I want my project to be a Windows Forms Application, but upon startup, I want to check a few files to see if they exist and if they don’t I do not want the startup form to load. I just want the program to quit. If you have to start this type of application with a form, how do you keep the form from displaying?”

If you program in CSharp, you probably already know the answer to this question, or at least you should.  If you don’t, you will when we finish here.  So since I consider this a VB.NET-specific question, I’m going to answer it using VB.NET syntax.

When CSharp runs a Windows Forms application, it writes out the following code in Program.cs (in VS 2008, earlier versions put this in the main form).

1
2
3
4
5
6
7
8
[STAThread]
static void Main()
{
Application.EnableVisualStyles();
Application.
SetCompatibleTextRenderingDefault(false);
Application.Run(new Form1());
}

In VB.NET there is no code that looks like this, because VB.NET writes the code for us behind the scenes.

So to do what you want to do, we need to take over control of the Windows Form application.

Since I’m assuming that you already have the Windows Form application created, the next thing you’ll want to do is to create a module.  You can name it what ever you want, but I’m going to name mine “Main” for purposes of this article.

In your module, create a function called “main” that has the code CSharp would have given us.

1
2
3
4
5
Public Sub main()
Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)
Application.Run(New Form1())
End Sub

Now go to your project properties and go to the Application tab.

image

Find the check box that says, “Enable Application Framework” and un-check it.

image

Then change the startup object to “Sub Main”

At this point, your application should run as it always has.  To put the checks in that you requested, write that code prior to all the Application… statements that we put in sub main and put an if/then/end if statement around the Application… statements.

1
2
3
4
5
6
7
Public Sub main()
Dim ChecksWereOk As Boolean = False ' your checks here If ChecksWereOk Then Application.EnableVisualStyles()
Application. _
SetCompatibleTextRenderingDefault(False)
Application.Run(New Form1())
End If
End Sub

And that should do the trick for you.

iTextSharp – HTML to PDF – Writing the PDF

B03B0085

Last week we parsed the HTML and created code that keeps track of the various attributes we are going to need when we create the PDF.  Today we will finish the code and create the Elements that we can include in our PDF document.

One consideration we will need to keep in mind as we write out the PDF is that we have pushed various font characteristics that may overlap onto our stack.

To get the current font, we will need to combine the attributes.  For example, HTML that looks like this:

1
<p><u><i><b>this should be bold</b></i></u></p>

Should render as bold, italics, underlined text.  But we only pushed one element at a time, so if all we look at is the last element we pushed onto the stack, all we will get is a bold font.

To help with this, I created a helper method that does all the work of determining the correct current font and returning that to the caller.

The first part of the method does the bulk of the work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
string[] fontArray = fontCharacteristicStack.ToArray();
int fontIndex = 0;
fontNormalBoldItalic nbi = 0;
for (; fontIndex < fontArray.Length; fontIndex++)
{
switch (fontArray[fontIndex].ToLower())
{
case "bold":
nbi |= fontNormalBoldItalic.Bold;
break;
case "italic":
nbi |= fontNormalBoldItalic.Italic;
break;
case "bolditalic":
case "italicbold":
nbi |= fontNormalBoldItalic.BoldItalic;
break;
case "underline":
nbi |= fontNormalBoldItalic.Underline;
break;
case "boldunderline":
case "underlinebold":
nbi |= fontNormalBoldItalic.UnderlineBold;
break;
case "italicunderline":
case "underlineitalic":
nbi |= fontNormalBoldItalic.UnderlineItalic;
break;
case "underlinebolditalic":
case "underlineitalicbold":
case "boldunderlineitalic":
case "bolditalicunderline":
case "italicunderlinebold":
case "italicboldunderline":
nbi |= fontNormalBoldItalic.UnderlineBoldItalic;
break;
}
}

The fontNormalBoldItalic thing is an enumeration that I used to allow me to merge the font characteristics.

The second half gets the remainder of the font information which can be determined from the current element and applies the characteristics we determined above into the font.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Font font = FontFactory.getFont(currentFontName);
string s = FontFactory.TIMES;
switch (nbi)
{
case fontNormalBoldItalic.Bold:
font.setStyle(Font.BOLD);
break;
case fontNormalBoldItalic.Italic:
font.setStyle(Font.ITALIC);
break;
case fontNormalBoldItalic.BoldItalic:
font.setStyle(Font.BOLDITALIC);
break;
case fontNormalBoldItalic.Underline:
font.setStyle(Font.UNDERLINE);
break;
case fontNormalBoldItalic.UnderlineBold:
font.setStyle(Font.UNDERLINE | Font.BOLD);
break;
case fontNormalBoldItalic.UnderlineItalic:
font.setStyle(Font.UNDERLINE | Font.ITALIC);
break;
case fontNormalBoldItalic.UnderlineBoldItalic:
font.setStyle(Font.UNDERLINE | Font.BOLDITALIC);
break;
}
font.setSize(currentFontSize);
if (currentFontColor.StartsWith("#"))
font.setColor(System.Convert.ToInt32(currentFontColor.Substring(1, 2), 16),
System.Convert.ToInt32(currentFontColor.Substring(3, 2), 16),
System.Convert.ToInt32(currentFontColor.Substring(5, 2), 16));
else font.setColor(System.Drawing.Color.FromName(currentFontColor));
return font;

This is all called from our case statement when the element is text.

1
2
3
4
5
6
7
8
9
case XmlNodeType.Text:
et = stack.Peek();
Font font = getCurrentFont();
if (et is Phrase)
((Phrase)(et)).add(
new Chunk(reader.Value.
Replace(" &amp; ", " & ").
Replace("&nbsp;"," "), font));
break;

You’ll notice that I’ve also added code at this point that translates the ampersand and the none breaking space so they render correctly in the PDF document.

Next time we address this topic we will try to close this all up with popping the attributes off the stack and dealing with the gaps between block elements that should (or should not) appear.

iTextSharp – HTML to PDF – Parsing HTML

iStock_000004663193Medium

Now that we have the HTML cleaned up, the next thing we will want to do is to parse the HTML.

In my actual code for this, I parse the HTML and create the PDF at the same time, but for the purposes of these posts, I’m going to deal primarily with parsing the HTML here and then deal with the PDF creation code later.

The key to parsing the HTML is that it is in XHTML form.  This allows us to use the XML APIs that are built into .NET.  For the purposes of parsing the HTML so that we can convert it to PDF code, we need to use the XMLTextReader.

Every time you Read() an XMLTextReader object, you will either be on a beginning tag, an ending tag, or text.  So the core of our loop looks something like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
XmlTextReader reader =
new XmlTextReader(xhtmlString,
XmlNodeType.Element, null);
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
// appropriate code break;
case XmlNodeType.EndElement:
// appropriate code break;
case XmlNodeType.Text:
// appropriate code break;
case XmlNodeType.Whitespace:
// appropriate code break;
}

}

where xhtmlString is the cleaned up HTML code from last week.

The core part of the translation is dependent on the fact that we have matching open and closing tags and that each time we hit an open tag, we can determine what the characteristics of that tag are.  Bold, underline, font, font size, etc.

So each time we hit the open tag, we will look up the characteristics.  For simplicity, I put this information in a resource file so that I could just look it up using code that looks something like this:

1
2
fontName = Resources.html2pdf .ResourceManager
.GetString(tagName + "_fontName");

rather than having another long case statement in my code.

Once we have the information we want from the resource file, we place the current characteristics on a stack.  I created a different stack for each element, but in hindsight, it might have been better to create a structure with the information and use one stack of type in that structure.

Here’s the code that does that

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
if (!reader.IsEmptyElement)
{
fontName = Resources.html2pdf.
ResourceManager.
GetString(tagName + "_fontName");
if (fontName != null)
currentFontName = fontName;
fontSize = Resources.html2pdf.
ResourceManager.
GetString(tagName + "_fontSize");
if (fontSize != null)
currentFontSize = System.
Convert.ToSingle(fontSize);
fontColor = Resources.html2pdf.
ResourceManager.
GetString(tagName + "_fontColor");
if (fontColor != null)
currentFontColor = fontColor;
fontCharacteristics = Resources.html2pdf.
ResourceManager.
GetString(tagName + "_fontCharacteristics");
if (fontCharacteristics != null)
currentFontCharacteristics =
fontCharacteristics;
}

Note that we only push the attributes of the element onto the stack if there is no content in the element.  This is because the closing node type will never be triggered on an element that has no content inside of it (BR and IMG tags, for example).

The final thing you’ll need to keep track of is if the element is a block element (P, DIV, etc) an inline tag (SPAN, A, etc) a list (OL,UL,LI), or even how much indentation is needed (primarily for list).

Frankly, the code for this was not fun to write.  Keep in mind too that there is nothing in here to handle special font characteristic attributes.  So your DIV tags can’t specify what font they should use or even how wide the font should be.  Not because it can’t be done, but because I have not had the need.

Here’s that code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
strIndent = Resources.html2pdf.ResourceManager.GetString(tagName + "_indent");
isBlock = Resources.html2pdf.ResourceManager.GetString(tagName + "_isBlock");
string isList = Resources.html2pdf.ResourceManager.GetString(tagName + "_isList");
if (isBlock != null && isBlock.ToLower() == "true")
{
string strIsList =
Resources.html2pdf .ResourceManager
.GetString(tagName + "_isULList");

if (strIsList != null &&
strIsList.ToLower() == "true")
{
p = new List(false,
System.Convert .ToSingle(strIndent));
}
else {
strIsList = Resources
.html2pdf.ResourceManager
.GetString(tagName + "isOLList");
if (strIsList != null && strIsList.ToLower() == "true")
{
p = new List(true,
System.Convert.ToSingle(strIndent));
}
else {
if (isList != null && isList.ToLower() == "true")
{
p = new iTextSharp.text.ListItem();
}
else {
p = new Paragraph();
((Paragraph)p)
.SetLeading(m_leading, 1);
if (stack.Count != 0)
{
IElement e = stack.Pop();

}
}
}
}
if (isList != null && isList.ToLower() == "true")
((iTextSharp.text.List)
(list[list.Count - 1])).Add(p);
else list.Add(p);
stack.Push(p);
}

You’ll notice that there is a bit of code in here that deals with a p variable.  This code is needed so that if we are dealing with a block tag, we have a paragraph or list item to put the other content inside of the block when we hit it.  If we are dealing with an inline tag, we deal with that when we add the text.

Next week, we will show how to handle text and closing tags.

iTextSharp – HTML to PDF – Cleaning HTML

H05K0013 The last prerequisite step prior to actually converting our HTML into PDF code is to clean up the HTML. The method I use takes advantage of the XML parser in .NET but in order to use that we have to have XHTML compliant XML. For this exercise, what I am most concerned about is that the HTML tags all have matching closing tags, that the tags are nested in a hierarchical structure, and that the tags all are lower case. Some of this we will have to rely on the user to provide, like properly nesting the tags.  But some of this we can attempt to clean up in our code.  If you know you will have complete control over your HTML, you might be able to skip this step.  But I think the code is simple enough that you’ll want to add it anyhow. In my code, I have a function that accepts the HTML string and returns a collection of IElements that my main code will insert into the PDF.  The first thing I do in that function is make sure the code starts and ends with an open and close paragraph tag.  This is to ensure that I have at least one element that I can work with when I do the translation.

1
2
if (!xhtmlString.ToLower().StartsWith("<p>"))
xhtmlString = "<p>" + xhtmlString + "</p>";

The next thing I do is make sure that all of the white space that isn’t the space character is removed from the code.

1
2
3
4
5
xhtmlString = xhtmlString
.Replace("\
", string.Empty)
.Replace("\n", string.Empty)
.Replace("\t", string.Empty);

Then we want to change our BR tags to auto close.  Since I don’t deal with IMG tags in this code I don’t bother auto closing those tags.  If you decide to embellish this code to use the IMG tag, you’ll want to add code to fix that as well.

1
2
3
xhtmlString = xhtmlString
.Replace("<BR>", "<br />")
.Replace("<br>", "<br />");

Since my code currently ignores any attributes in the SPAN tag, I then remove the span tag’s attributes.

1
2
3
4
5
6
7
8
9
10
11
System.Text.RegularExpressions.Regex re = null;
System.Text.RegularExpressions.Match match = null;

re = new System.Text.RegularExpressions.Regex("<span.*?>");
match = re.Match(xhtmlString);
while (match.Success)
{
foreach (System.Text.RegularExpressions.Capture c in match.Captures)
xhtmlString = xhtmlString.Replace(c.Value, string.Empty);
match = match.NextMatch();
}

Then I force all my tags to lower case

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
re = new System.Text.RegularExpressions.Regex("<\\\w+?");
match = re.Match(xhtmlString);

while (match.Success)
{
foreach (System.Text.RegularExpressions.Capture c in match.Captures)
xhtmlString = xhtmlString.Replace(c.Value, c.Value.ToLower());
match = match.NextMatch();
}

re = new System.Text.RegularExpressions.Regex("</\\w+?>");
match = re.Match(xhtmlString);
while (match.Success)
{
foreach (System.Text.RegularExpressions.Capture c in match.Captures)
xhtmlString = xhtmlString.Replace(c.Value, c.Value.ToLower());
match = match.NextMatch();
}

Because the PDF code will treat each white space character as a character and HTML treats a string of white space characters as one space, I strip out any extra white space characters.  

1
2
3
4
while (xhtmlString.Contains("> "))
xhtmlString = xhtmlString.Replace("> ", ">");
while (xhtmlString.Contains(" "))
xhtmlString = xhtmlString.Replace(" ", " ");

And then I convert any special HTML strings to their text equivalent.  Right now, I only have to deal with the ampersand character.

1
xhtmlString = xhtmlString.Replace(" & ", " &amp; ");

Lastly, in order to ensure that my html string gets parsed correctly, I attempt to quote all my attributes

1
2
3
4
5
6
7
8
9
int length = 0;
while (length != xhtmlString.Length)
{
length = xhtmlString.Length;
xhtmlString = System.Text
.RegularExpressions.Regex .Replace(xhtmlString,
"(<.+?\\s+\\w+=)(\[^\"'\]\\S*?)(\[\\s>\])", "$1\"$2\"$3");

}

With the exception of some features I’ve already noted that you might want to add, we’ve done all we can to clear up the code.  Any other problems are user input errors that will need to be corrected manually. Next, we can parse this HTML and convert it into PDF IElements. This process of cleaning up the HTML would all be a lot easier if HTML Tidy were converted to a managed code library.  (Yes, I know you can run it from .NET, but so far it is an external EXE, not managed code.)

Manually Adding Event Handlers in VB.NET

office-019

Typically when we write our code, the event handlers get wired up for us using the handles clause.  So we never have to worry about wiring up our event handlers manually.

But what about the case where we want to dynamically add a control to our Windows Form or our ASP.NET page?  For example, add a button.  How would you respond to the button click event?

In CSharp, there is no handles clause, so figuring out how to manually wire up the event handler is simply a matter of inspecting the dotNet code and doing a copy/paste/modify operation in the editor.

The syntax for adding event handlers manually is not that difficult.

1
AddHandler m_button.Click, AddressOf buttonClickMethod

If you’ve written any threading code, you’ll notice that this looks similar to the code you might have written for that.

The AddHandler statement takes two parameters.  The first is the event we are going to handle–in this case, the click event from the object that m_button is pointing to.

The second parameter is a pointer to a function that will handle the event.  What is unique about this is that it can be a method that is part of the current class, which is what the code above is referencing, or it can be a method in another object, or even a method that is shared in another class.

To reference a method in another object

1
2
AddHandler m_button.Click, _
AddressOf SomeOtherObject.buttonClickMethod

To reference a shared method

1
2
AddHandler m_button.Click, _
AddressOf SomeClass.buttonClickMethod

Which gives us quite a bit of flexibility when we dynamically wire up our events.

iTextSharp – HTML to PDF - Prerequisites

animal-015

Before we get into the nitty gritty of parsing the HTML so that we can create PDF code from it, it is important that we develop the concept of how text layout works in iTextSharp.  So today we will cover those basics.

The first type of element we want to deal with when we parse our HTML into a PDF is the Paragraph element.

When we get to actually parsing our HTML to PDF code we will use the Paragraph object for all of our block elements.  This allows us to add other Paragraphs and Chunks into it which we can format.

A Chunk is our second object that we will be using.  The Chunk is the main object that will allow us to format the font.  In fact, even if our block element specifies some sort of specific font, the font doesn’t actually get applied in the code until we add the text.

Typical code to place text into a PDF document would look something like this

1
2
3
4
p = new Paragraph(new Chunk("text that needs a font",
FontFactory.GetFont("Arial", 10, Font.NORMAL, Color.BLACK)));
p.Alignment = (Element.ALIGN_CENTER);
ct.AddElement(p);

where “ct” is an object of type ColumnText that we discussed last week.

The only other two classes we need to discuss are the list classes.  We use the List to create an item that will handle both the OL and UL tags.  The ListItem class will handle the individual items within the list.  The List constructor handles which of the two types of list we are dealing with by specifying true or false in the first parameter, numbered.

I have not yet added the ability to handle tables to my HTML parser mainly because I have not had the need.  I think once I show you how to create tables and how to parse HTML you should be able to handle adding table parsing code yourself.

iTextSharp – HTML to PDF – Positioning Text

The next series of things I’m going to introduce about using iTextSharp are all going to lead toward taking HTML text and placing it on the PDF document.

There are several items we need to cover before we even get to the part about converting the text from HTML to PDF text.  The first is placing the text on the document where it is supposed to be.

Once again, we are building on previous articles about using iTextSharp.  So if you are just jumping in, you might want to go take a look at the other articles.  You can find a list at the bottom of this post.

To place a block of text on the screen that is going to have multiple formats in it (bold, underline, etc) I use the ColumnText class.  This allows me to specify the rectangle or, if I want, some irregular shape, to place the text in.  I handle determining where this rectangle is on the page in the same way that I determine where an image should go.  I have the designer place a form field on the screen and then I use that to get my coordinates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
float[] fieldPosition = null;
fieldPosition =
fields.GetFieldPositions("fieldNameInThePDF");
left = fieldPosition[1];
right = fieldPosition[3];
top = fieldPosition[4];
bottom = fieldPosition[2];
if (rotation == 90)
{
left = fieldPosition[2];
right = fieldPosition[4];
top = pageSize.Right - fieldPosition[1];
bottom = pageSize.Right - fieldPosition[3];
}

Once I have the position, the next thing I need to do is to create my ColumnText object.  This requires the same ContentByte object that we used for the images.

1
2
PdfContentByte over = stamp.GetOverContent(1);
ColumnText ct = new ColumnText(over);

And now I can set the rectangle to print into.

1
2
ct.SetSimpleColumn(left, bottom, right, top,
15, Element.ALIGN_LEFT);

The 15 represents the leading you want (space between characters vertically). You may need to adjust that number.

Once you have your rectangle, you can add paragraphs to it.  Paragraphs are composed of smaller units called chunks that can be formatted.  If you want a paragraph that is all formatted the same you can make a call that looks like this.

1
2
3
4
Paragraph p = new Paragraph(
new Chunk("Some Text here",
FontFactory.GetFont(
"Arial", 14, Font.BOLD, Color.RED)));

and then add the paragraph to your rectangle

1
ct.AddElement(p);