The last prerequisite step prior to actually converting our HTML into PDF code is to clean up the HTML. The method I use takes advantage of the XML parser in .NET but in order to use that we have to have XHTML compliant XML. For this exercise, what I am most concerned about is that the HTML tags all have matching closing tags, that the tags are nested in a hierarchical structure, and that the tags all are lower case. Some of this we will have to rely on the user to provide, like properly nesting the tags. But some of this we can attempt to clean up in our code. If you know you will have complete control over your HTML, you might be able to skip this step. But I think the code is simple enough that you’ll want to add it anyhow. In my code, I have a function that accepts the HTML string and returns a collection of IElements that my main code will insert into the PDF. The first thing I do in that function is make sure the code starts and ends with an open and close paragraph tag. This is to ensure that I have at least one element that I can work with when I do the translation.
1 | if (!xhtmlString.ToLower().StartsWith("<p>")) |
The next thing I do is make sure that all of the white space that isn’t the space character is removed from the code.
1 | xhtmlString = xhtmlString |
Then we want to change our BR tags to auto close. Since I don’t deal with IMG tags in this code I don’t bother auto closing those tags. If you decide to embellish this code to use the IMG tag, you’ll want to add code to fix that as well.
1 | xhtmlString = xhtmlString |
Since my code currently ignores any attributes in the SPAN tag, I then remove the span tag’s attributes.
1 | System.Text.RegularExpressions.Regex re = null; |
Then I force all my tags to lower case
1 | re = new System.Text.RegularExpressions.Regex("<\\\w+?"); |
Because the PDF code will treat each white space character as a character and HTML treats a string of white space characters as one space, I strip out any extra white space characters.
1 | while (xhtmlString.Contains("> ")) |
And then I convert any special HTML strings to their text equivalent. Right now, I only have to deal with the ampersand character.
1 | xhtmlString = xhtmlString.Replace(" & ", " & "); |
Lastly, in order to ensure that my html string gets parsed correctly, I attempt to quote all my attributes
1 | int length = 0; |
With the exception of some features I’ve already noted that you might want to add, we’ve done all we can to clear up the code. Any other problems are user input errors that will need to be corrected manually. Next, we can parse this HTML and convert it into PDF IElements. This process of cleaning up the HTML would all be a lot easier if HTML Tidy were converted to a managed code library. (Yes, I know you can run it from .NET, but so far it is an external EXE, not managed code.)





When I first started generating PDFs dynamically, I was overwhelmed by the complexity of the API. Not just with iTextSharp, but it seemed that all of the APIs were complex. In looking through the API and comparing it to what I was actually trying to accomplish, I found there was a very small subset of classes and methods that I needed to use to accomplish the task at hand. Now that I’ve learned more, I still use this same subset of commands for 90% of what I need to do in iTextSharp. The reason we produce PDFs programmatically at all is because we need to dynamically generate some information on the page. Most of the time, this information comes out of a database and gets placed on the same location of the page each time the page is generated. The rest of the information is static. So what I normally do is have my designer or project manager create a PDF for me with form fields located where he wants the information to go. Using the form fields, he can define the font, size, color, and position he wants to display the text with. All I have to worry about is getting the text into the field. This works out nicely because once I’ve filled in the forms, he can move them around until he’s happy with them without asking for my help. We’ve already covered
There are several libraries on the market now that allow you to create PDF documents from your .NET applications. The one I’ve chosen to use is
On Monday, I was corrected in my assertion that creating multiple empty strings would create multiple objects. Turns out the compiler automatically puts all of the strings that are exactly the same in a “string pool” so that there is only ever one empty string in the entire application you’ve created.
I recently read an article that argued that “” is “Better than String.Empty”