How to read pdf files using C# .NET
including iText, PDFBox, PDF-Excel, etc
A summary of some resources available online for programming
in C# to produce software that will read data from files stored in Adobe portable
document format (pdf).
Firstly, what pdf is
- the Adobe
PDF Technology Center PDF Reference webpage includes links to
the definitive pdf specification, including the PDF Reference and Related
Documentation (over 15MB). Adobe publishes the full specification to
"foster the creation of an ecosystem around the PDF format"
C# Resources for reading PDF files
- A PDF Forms Parser by Michael Ganss addresses
the problem of filling data into a pdf form programmatically (for example, with generated content or data read from a database).
He writes, "The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project
where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types
of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary." The
page links to a 22.3KB source code download. (Oct 2004)
- iText is a library that enables you to generate PDF files on the
fly. The documentation says, "the iText classes are very useful for people who need to generate read-only, platform independent
documents containing text, lists, tables and images; or who want to perform specific manipulations on existing PDF documents."
It is written for use in Java systems but there is a .NET port available: iTextSharp
(written in C#), implemented as an assembly and downloadable from this page
on SourceForge (Nov 2007)
- PDFBox is a Java library (see sub-bullet for how to use
it in C# .NET) that lets you create new PDF documents, manipulate existing documents and extract content from documents. PDFBox
also includes several command line utilities. Functionality includes; PDF to text extraction; Merge PDF Documents; PDF Document
Encryption/Decryption; Lucene Search Engine Integration; Fill in form data FDF and XFDF; Create a PDF from a text file; Create images
from PDF pages; Print a PDF. PDFBox can be downloaded from
SourceForge.
- Read from a PDF file using
C# on Lucian's Weblog shows you how to use PDFBox with IKVM in a C# .NET project. IKVM
is an implementation of Java for Mono and the Microsoft .NET Framework, and includes a Java Virtual Machine implemented in .NET,
a .NET implementation of the Java class libraries and tools that enable Java and .NET interoperability. First download IKVM and
PDFBox. Then in Visual Studio .NET you need to add two dlls to your project: these are IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll.
You then need to copy FontBox-0.1.0-dev.dll and IKVM.Runtime.dll into your project's bin directory. A good place to start is then
a simple example of a three-line C# program to read text from a pdf file as given on Lucian's
Weblog. The comments on the page address many of the common problems and errors that users found (including errors about bcprov-jdk14-132.dll,
the error message "The type initializer for 'java.io.File' threw an exception", TypeLoadException, issues with tables,
fonts, images, etc)
- Some Open Source PDF Libraries in C# here include
iTextSharp, PDFsharp,
Report.NET, SharpPDF,
ASP.NET fo PDF, PDF
Clown, PDFjet Open Source Edition.
- Winnovative Software Solutions produce a number of utilities for sale:
- Winnovative HTML to PDF Converter Library for .NET
- Winnovative PDF Creator Library for .NET
- Winnovative RTF to PDF Library for .NET
- Winnovative PDF Merge Library for .NET
- Winnovative PDF Split Library for .NET
- Winnovative PDF Security Library for .NET
- Winnovative PDF Viewers ASP.NET and Windows Forms
- Winnovative Chart Control for .NET
- Winnovative PDF Tools for .NET
- Winnovative Reporting Tools for .NET
Extracting images from PDF files using C#
- Winnovative
Software Solutions produce PDF
Images Extractor for .NET is a .NET 2.0 library enabling you to extract
images from a PDF file in formats such as bmp, png, jpeg, etc. It includes
samples of C# code. An evaluation
version can be downloaded and the full
product can be purchased from their website.
- if you want to extract image files using a desktop utility
instead of writing C# code, FileBuzz
feature a shareware product called A-PDF
Image Extractor v1.0.0 which can extract image files from a single PDF
file or a batch of PDF files. It can save images in TIFF, JPEG, GIF, BMP,
PNG, TGA, PCX, ICO, JP2 (JPEG 2000) and DCX format, and supports a variety
of image filters used in PDF files including LZWDecode, FlateDecode, RunLengthDecode,
CCITTFaxDecode (TIFF), JBIG2Decode (JBig2), DCTDecode (JPEG), and JPXDecode
(JPEG 2000).
Writing data to a pdf file:
Chris
Hornberger wrote on Jul 2 2003, 6:59 am: "Create a Crystal report with the
information you want on it, then simply export it to PDF. The fact that you're
using C#, I assume you're also using VS Studio.NET and hence, have Crystal too.
This will allow you to create your PDF file. Another choice is to spring for
Adobe Pagemill and print to the PDF file format."
The brief article "Microsoft
Visual Studio.NET: Crystal Reports" by Mujtaba Khambatti explains the
benefits of Crystal Reports, designing a report, and using Crystal Reports in
projects you create.
There is comprehensive documentation on .NET Crystal Reports
in the Microsoft Developer Network in the section MSDN
> MSDN Library > Development Tools and Languages > Visual Studio .NET
> Developing with Visual Studio .NET > Designing Distributed Applications
> Crystal Reports
Other Assorted PDF Utilities:
Some free utilities are available for download - instead of
writing your own software, this section may save you the trouble of re-inventing
the wheel...
- The RubyPDF
Blog contains an assortment of free utilities for manipulating
pdf files:
- pdf cropper to remove white margins
- BookmarkExtractor to extract all bookmarks from a pdf file
- PDF2PPT converts a pdf to a PowerPoint file
- Pdfrotate rotates every page a chosen multiple of 90°
- pdfselect extracts pages, splits a pdf or reverses it
- PDF N-UP Maker builds a n-up pdf or booklet
- WOA
PDF-Excel is a free utility that lets you convert pdf files into
Excel; customisable, it lets you extract data from a folder of pdf files into
a single spreadsheet (eg a folder of invoices, etc). Produced by Wilkie Office
Automation (2006)
Other useful links:
Some background reading...