PDF Automation Testing

When it comes to Automation testing, we might have to cover different user flows. Most of these user actions can be covered through Selenium or any other automation tool or technology. But when it comes to some special scenarios like testing a user flow with some document generation, for example, let’s say at the end of our user flow the system will generate a PDF file, these cases cannot be covered through normal automation technologies such as selenium. In this case we need to use any other 3rd party library with our main technology.

In my above shared example, after generating this specific file we need to compare it with some base file which we can use as the expected outcome. So, then we need to compare and check both files to see whether there are any differences or not. When it come to this level, we need to check for font, colors, margins, other styles, content, images and other attachements, etc. This will be really hard even to do manually. Because some small differences in color contrast or margins will not be captured via a naked eye. So, it’s always good to use some automated tool to do this.

There are different 3rd party libraries available which will enable us to do this. Following are some such approaches,

Apache PDFBox (https://pdfbox.apache.org/) : free but can only to be used to verify text. Images/ graphics and styles not possible
PDFUnit (http://www.pdfunit.com/) : free and can be used to verify text, images, graphics. But not available in a central maven repo, need to have it installed locally
PdfCompare (https://github.com/red6/pdfcompare) : not a product, a solution done using Apache PDFBox. renders pdf files in to a bitmap image and compares these two images pixel by pixel
Java difference library (https://releases.groupdocs.com/comparison/java/) : free, can use to compare images, graphics and styles too. (But this was not able to identify the difference between files I used for testing via their online tool)
Aspose (https://products.aspose.com/) : license required, can use to compare images, graphics and styles too (this identified the difference between files I used for testing via their online tool)
Convert PDF to HTML and do assertions using selenium in normal way : This will be really hard if you have a big file with large number of pages

pdfcompare library

https://github.com/red6/pdfcompare

Here it will convert the PDF file to an image then it will comapre two images byte by byte and identify the mismatches in the two files. So for PDF testing first we need to do a manual round of testing and generate the correct PDF file, then include it in the automated TC as the expected outcome. When everytime we run the TC, then it will generate a new file and will compare it with the existing expected file according to the mechanism mentioned above. Using following maven code snippet you can the dependency to your java project,

<dependencies>
  <dependency>
    <groupId>de.redsix</groupId>
    <artifactId>pdfcompare</artifactId>
    <version>...</version> <!-- see current version in the maven central tag above -->
  </dependency>
</dependencies>

Then doing the comparison is a single line of code,

new PdfComparator("expected.pdf", "actual.pdf").compare().writeTo("/path/diffOutput");

Above code will create a new file with differences at the specified path. If you want to just check and return, then can remove the writeTo() method. Then it will do the comparison at run time and return results without creating a new file.

final CompareResult result = new PdfComparator("expected.pdf", "actual.pdf").compare();
if (result.isNotEqual()) {
    System.out.println("Differences found!");
}
if (result.isEqual()) {
    System.out.println("No Differences found!");
}
if (result.hasDifferenceInExclusion()) {
    System.out.println("Differences in excluded areas found!");
}
result.getDifferences();

Also, in some cases there can be some parts in our PDF file which will be dynamically generated at the TC run time such as date, time, some unique IDs, etc. In these cases we need to ignore those parts and do the comparison since everytime that will be different from the previous file. PDFCompare library suports that too. As I mentioned above since this library converting PDF to image and then doing byte by byte comparison, we can pass the byte locations of the file which we need to skip in the comparison.

new PdfComparator("expected.pdf", "actual.pdf")
	.withIgnore(new PageArea(1, 230, 350, 450, 420))
	.withIgnore(new PageArea(2))
	.compare();

Herewith I have attached some samples I used to test this library.

File One (Actual File which I used as the base file, Expectedo outcome) :

fileone Download

File Two (Let’s say this was the file generated after TC run, Actual outcome) :

filetwo Download

File Three (Diff file generated via the PDFCompare Library) :

pdfcompareresults Download

So if you check the above file, it will mark the differences in different color codes. So, it’s easy to identify all the difference at one go. I hope this will be useful for all of you at your day to day testing life. Now you know how to do a PDF file comparison automation testing easily. Thank you.