How to read pdf files using C# .NET

including iText, PDFBox, PDF-Excel, etc

A summary of some resources available online for programming in C# to produce software that will read data from files stored in Adobe portable document format (pdf).

Step-by-step instructions and sample C# code are at the bottom of the page.

Firstly, what pdf is

the Adobe PDF Technology Center PDF Reference webpage includes links to the definitive pdf specification, including the PDF Reference and Related Documentation (over 15MB). Adobe publishes the full specification to "foster the creation of an ecosystem around the PDF format"

C# Resources for reading PDF files

A PDF Forms Parser by Michael Ganss addresses the problem of filling data into a pdf form programmatically (for example, with generated content or data read from a database). He writes, "The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary." The page links to a 22.3KB source code download. (Oct 2004)
iText is a library that enables you to generate PDF files on the fly. The documentation says, "the iText classes are very useful for people who need to generate read-only, platform independent documents containing text, lists, tables and images; or who want to perform specific manipulations on existing PDF documents." It is written for use in Java systems but there is a .NET port available: iTextSharp (written in C#), implemented as an assembly and downloadable from this page on SourceForge (Nov 2007) – see iTextSharp code example below.
- Linda Fahmy explains how to create Arabic pdf files (May 2007)
- iTextSharp Tutorial Codes(C#) gives 100 examples that teach you how to use iTextSharp (on RubyPDF blog)
PDFBox is a Java library (see sub-bullet for how to use it in C# .NET) that lets you create new PDF documents, manipulate existing documents and extract content from documents. PDFBox also includes several command line utilities. Functionality includes; PDF to text extraction; Merge PDF Documents; PDF Document Encryption/Decryption; Lucene Search Engine Integration; Fill in form data FDF and XFDF; Create a PDF from a text file; Create images from PDF pages; Print a PDF. PDFBox can be downloaded from SourceForge.
- Read from a PDF file using C# on Lucian's Weblog shows you how to use PDFBox with IKVM in a C# .NET project. IKVM is an implementation of Java for Mono and the Microsoft .NET Framework, and includes a Java Virtual Machine implemented in .NET, a .NET implementation of the Java class libraries and tools that enable Java and .NET interoperability. First download IKVM and PDFBox. Then in Visual Studio .NET you need to add two dlls to your project: these are IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll. You then need to copy FontBox-0.1.0-dev.dll and IKVM.Runtime.dll into your project's bin directory. A good place to start is then a simple example of a three-line C# program to read text from a pdf file as given on Lucian's Weblog. The comments on the page address many of the common problems and errors that users found (including errors about bcprov-jdk14-132.dll, the error message "The type initializer for 'java.io.File' threw an exception", TypeLoadException, issues with tables, fonts, images, etc)
Some Open Source PDF Libraries in C# here include iTextSharp, PDFsharp, Report.NET, SharpPDF, ASP.NET fo PDF, PDF Clown, PDFjet Open Source Edition.
Winnovative Software Solutions produce a number of utilities for sale:
- Winnovative HTML to PDF Converter Library for .NET
- Winnovative PDF Creator Library for .NET
- Winnovative RTF to PDF Library for .NET
- Winnovative PDF Merge Library for .NET
- Winnovative PDF Split Library for .NET
- Winnovative PDF Security Library for .NET
- Winnovative PDF Viewers ASP.NET and Windows Forms
- Winnovative Chart Control for .NET
- Winnovative PDF Tools for .NET
- Winnovative Reporting Tools for .NET

Extracting images from PDF files using C#

Winnovative Software Solutions produce PDF Images Extractor for .NET is a .NET 2.0 library enabling you to extract images from a PDF file in formats such as bmp, png, jpeg, etc. It includes samples of C# code. An evaluation version can be downloaded and the full product can be purchased from their website.
if you want to extract image files using a desktop utility instead of writing C# code, FileBuzz feature a shareware product called A-PDF Image Extractor v1.0.0 which can extract image files from a single PDF file or a batch of PDF files. It can save images in TIFF, JPEG, GIF, BMP, PNG, TGA, PCX, ICO, JP2 (JPEG 2000) and DCX format, and supports a variety of image filters used in PDF files including LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode (TIFF), JBIG2Decode (JBig2), DCTDecode (JPEG), and JPXDecode (JPEG 2000).

Writing data to a pdf file:

Chris Hornberger wrote on Jul 2 2003, 6:59 am: "Create a Crystal report with the information you want on it, then simply export it to PDF. The fact that you're using C#, I assume you're also using VS Studio.NET and hence, have Crystal too. This will allow you to create your PDF file. Another choice is to spring for Adobe Pagemill and print to the PDF file format."

The brief article "Microsoft Visual Studio.NET: Crystal Reports" by Mujtaba Khambatti explains the benefits of Crystal Reports, designing a report, and using Crystal Reports in projects you create.

There is comprehensive documentation on .NET Crystal Reports in the Microsoft Developer Network in the section MSDN > MSDN Library > Development Tools and Languages > Visual Studio .NET > Developing with Visual Studio .NET > Designing Distributed Applications > Crystal Reports

Other Assorted PDF Utilities:

Some free utilities are available for download - instead of writing your own software, this section may save you the trouble of re-inventing the wheel...

The RubyPDF Blog contains an assortment of free utilities for manipulating pdf files:
- pdf cropper to remove white margins
- BookmarkExtractor to extract all bookmarks from a pdf file
- PDF2PPT converts a pdf to a PowerPoint file
- Pdfrotate rotates every page a chosen multiple of 90°
- pdfselect extracts pages, splits a pdf or reverses it
- PDF N-UP Maker builds a n-up pdf or booklet
WOA PDF-Excel is a free utility that lets you convert pdf files into Excel; customisable, it lets you extract data from a folder of pdf files into a single spreadsheet (eg a folder of invoices, etc). Produced by Wilkie Office Automation (2006)

A worked example – C# code to read a pdf document properties:

Here are step-by-step instructions for using C# and Visual Studio to read the properties of a pdf file using iTextSharp:

(1) Download the most recent version of iTextSharp from http://sourceforge.net/projects/itextsharp/ and unzip the file
(2) Create a new project in Visual Studio (screenshots are from VS2005 and I created a console application) – now you need to add a reference to the iTextSharp dll In the solution explorer, right click on the project name and select Add Reference…
(3) Click the Browse tab
(4) … and now navigate to the folder into which you unzipped the iTextSharp.dll file, click it, then click the OK button
(5) You will now see itextsharp listed in the solution explorer under References.

Now you can use any of the techniques illustrated in over 200 tutorial files that you also unzipped from the download. Here's a simple bit of code that reads in the properties of a pdf file that's on the web (chosen at random) at:
http://www.chinehamchat.com/Chineham_Chat_Advertisements.pdf

(note: the sample pdf file may have changed since I ran my program)

using System;

using System.Collections.Generic;

using System.Text;

using iTextSharp.text;

using iTextSharp.text.pdf;

namespace PdfProperties

{

    class Program

    {

        static void Main(string[] args)

        {

            // create a reader (constructor overloaded for path to local file or URL)

            PdfReader reader

                = new PdfReader("http://www.chinehamchat.com/Chineham_Chat_Advertisements.pdf");

            // total number of pages

            int n = reader.NumberOfPages;

            // size of the first page

            Rectangle psize = reader.GetPageSize(1);

            float width = psize.Width;

            float height = psize.Height;

            Console.WriteLine("Size of page 1 of {0} => {1} × {2}", n, width, height);

            // file properties

            Dictionary<string, string> infodict = reader.Info;

            foreach (KeyValuePair<string, string> kvp in infodict)

                Console.WriteLine(kvp.Key + " => " + kvp.Value);

        }

    }

}

from which the output (eventually – you need to give time for the pdf to download) is:

Size of page 1 of 24 => 421 × 595
ModDate => D:20120122082532Z
CreationDate => D:20101117141712Z
Title => Chineham Chat Advertisement Supplement
Creator => PScript5.dll Version 5.2.2
Author => Chineham Chat Magazine
Keywords => Chineham Chat, Magazine, Basingstoke, Advertisements
Subject => Adverts from the Chineham Chat magazine, distributed free to all households in Chineham, Basingstoke, Hampshire, UK
Producer => Acrobat Distiller 4.05 for Windows

While you’re typing in the code, you'll notice when you type reader. that Intellisense gives you a long list of methods and properties – evidence of the breadth of functionality in this library.

I cannot guarantee any of the quoted, linked information - it was taken on trust from the linked websites - September 2008

The example program code at the bottom is my own work, and I can vouch for that. It was produced in 2010.

Noticed an error, dead link or omission? Please email me (send to webmaster at the domain name jadn.co.uk)

How to read pdf files using C# .NET including iText, PDFBox, PDF-Excel, etc