I've been a zone leader with DZone since 2008, and I'm crazy about community. Every day I get to work with the best that JavaScript, HTML5, Android and iOS has to offer, creating apps that truly make at difference, as principal front-end architect at Avego. James is a DZone Zone Leader and has posted 639 posts at DZone. You can read more from them at their website. View Full User Profile

Converting PDF to HTML Using PDFBox

04.07.2010
| 43134 views |
  • submit to reddit

Over the past few days, while working on another project, I needed to covert PDF documents into HTML. I did the usual searches for tools, but as I'm sure you'll have noticed, the tools available don't get great results. But then, seeing as I'm a software developer, I decided to see if I could program it myself. My requirements were quite simple: get the text out of the document, with the aim of HTML output, and extract the images at the same time.

My first port of call was iText, as it was a library that I was already familiar with. iText is great for creating documents, and I was able to get some text out, but the image extraction wasn't really working out for me. The following is a code snippet that I was using to get the images from the PDFs in iText, based on a post on the iText mailing list. But when I used it, none of the images I generated were right - mostly just the box outlines/borders of the images in the PDF. I presume I was doing something wrong.

PdfReader reader = new PdfReader(new FileInputStream(new File("C:\\test.pdf")));

for(int i =0; i < reader.getXrefSize(); i++)
{
PdfObject pdfobj = reader.getPdfObject(i);
if(pdfobj != null)
{
if (!pdfobj.isStream()) {
//throw new Exception("Not a stream");
}
else
{
PdfStream stream = (PdfStream) pdfobj;
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
if (pdfsubtype == null) {
// throw new Exception("Not an image stream");

}
else
{
if (!pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
//throw new Exception("Not an image stream");
}
else
{
// now you have a PDF stream object with an image
byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
// but you don't know anything about the image format.
// you'll have to get info from the stream dictionary
System.out.println("----img ------");
System.out.println("height:" + stream.get(PdfName.HEIGHT));
System.out.println("width:" + stream.get(PdfName.WIDTH));
int height = new Integer(stream.get(PdfName.HEIGHT).toString()).intValue();
int width = new Integer(stream.get(PdfName.WIDTH).toString()).intValue();
System.out.println("bitspercomponent:" +
stream.get(PdfName.BITSPERCOMPONENT));

java.awt.Image image = Toolkit.getDefaultToolkit().createImage(img);
BufferedImage bi = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB);
Graphics2D g2 = bi.createGraphics();
ImageIO.write(bi, "PNG",new File("C:\\images\\"+ i + ".png"));
}

}
}
// ...
// // or you could try making a java.awt.Image from the array:
// j

}
}


}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch(Exception e)
{
e.printStackTrace();
}

As I was low on time, I moved onto PDFBox which looked like it had already considered my use cases. I got the latest source code from SVN and tried the org.apache.pdfbox.ExtractText class straight away. This allows you to specify a -html flag instead of using the default text output. I ran into an exception straight away. After some debugging I found that what I had downloaded was missing the resources/glyphlist.txt file. I found a copy on the Adobe site and was able to run the utility then. 

One other thing to note while using these utilities is that you'll need to have ICU4J, iText and the Apache Commons Logging libraries on your build path. 

The good news was that the utility got all the text out and put it into a HTML format. But the generated HTML wasn't that pretty. Each line that it read got terminated with a <br/>, admittedly, an easy thing to change around.

Moving onto image extraction, I tried out org.apache.pdfbox.ExtractImages. This class worked perfectly, saving all the images in the PDF as jpeg. I did make one alteration to PDXObjectImage.write2file so that I put the images in a particular folder.

The PDFBox utilities really impressed me, as I wasn't sure if it was possible to get this information out of the PDF so easily. All the pieces are there for one single utility that would generate better HTML for you along with the images. As far as I know, no solution exists to do all of this in Java (if I'm wrong, please let me know in the comments section). Have any of the readers tried to achieve this process using iText, PDFBox or any other Java library? 

Tags:

Comments

Joe Gerew replied on Wed, 2010/04/07 - 8:21am

I worked on a project a few years ago that took PDF Acroforms and transformed them into HTML forms using iText. It wasn't a native feature of iText but using the properties from the Acroform fields I was able to generate the HTML as well as the correct order and rows/columns. We were planning on going forward with image extraction as well but ended up not doing that. It sounds like that would have been kind of a nightmare from your post.

Abel Birya replied on Tue, 2011/07/12 - 1:59pm

Hey, I was having a look at your post and I have to say it is quite interesting. I have a pet project where I may want to generate HTML from PDF files. Is there an example that you have done with PDFBox where you have extracted both html text as well as the images? I have been trying out PDFBox but so far I have been getting the text without any images. The styling also seems to lave quite a bit to be desired. Anyway, I am more interested in getting the text first and work on the rest later. Kindly assist.

Martin Chris replied on Thu, 2012/03/29 - 1:04pm

Hi all,

I have reached upto the stage where i have collected all the data.

i.e.

1.Extracted the text from the pdf.

2.extracted the images

Now the remaining thing is .

a.To retain the formating in the newly converted html page same as that of pdf page.

b. To embed the images into the newly converted html page in the appropriate places as that of pdf.

c. Applying color scheme to html page.

For this purpose i have explored much.    

Any help would be appreciated.

Martin Chris replied on Wed, 2012/04/18 - 4:41am in response to: Martin Chris

Is there any update from you guys.

 

Thanks in advance

Zia Rahman replied on Thu, 2013/01/03 - 2:59am

Hi James Sugrue,

I'm very much pleased to tell you that your link is very much helpful.

It would be more helpful for me if you send the code for converting pdf to Html file of you with an example.


Thanks in Advance

Olivier Duchêne replied on Thu, 2013/01/10 - 4:52am in response to: Zia Rahman

Hi James,

Your code for iText is nearly correct. Just replace the last lines with these ones and all is working well :

PdfImageObject image = new PdfImageObject((PRStream) stream);
BufferedImage bi = image.getBufferedImage();
if(bi == null) continue;
ImageIO.write(bi, "PNG",newFile("D:\\Documents\\images"+ i + ".png"));

Cheers, 

 Olivier

Gunasilan Muniandy replied on Sun, 2014/01/05 - 9:38pm

Can you share your code to convert pdf text to html.

Thanks

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.