Moritz Schepp

A PDF Forms workflow … the open way

Creating PDF forms, shipping them to your users and collecting the data in an automated way … all with open tools and formats.

I was recently looking for a way to create PDF forms. This seemed like an “enterprise” endeavor so I looked for commercial options. It turns out that this is possible in a very comfortable way in LibreOffice Writer: After the PDF file is created, it can be shipped to users who may fill it out. Once returned, there are tools available to retrieve the data automatically, like pdftk.

Creating the PDF file

The only tool needed here, is the actual LibreOffice Writer. Go ahead and open it. In order to add fields, buttons and so on, you will need to enable the Form Control Toolbar. Find it in the menu at View > Form Controls

Now add a text box: Just click on the Text Box button and then drag on the document to create it. Then double click the box to pop up the properties dialog for that control.

There are three tabs there and a lot of configuration options, but make sure you do at least two things for each control you add to your form:

  • On the General tab, for Font, select something that you expect all of your form users to have installed … like Arial or Times New Roman.
  • On the Data tab under Data Field, give the field a unique name within that PDF form. This ensures that data entered by users can be cleanly retrieved.

When you are satisfied with your form, its time to export it to an actual PDF. From the menu, select File > Export as PDF... Then check Create PDF form on the General tab and select FDF as the Submit format. You can now export and distribute the resulting PDF file to your users.

Users can fill out the form and save the PDF file including the values they filled in. After they sent it back to you, you can mass process the results.

Extracting FDF data from filled out PDF forms

Of course, you can just open the filled out PDF files with a reader and use the values in whatever way you see fit. But if you handed out the forms to a lot of people and you received a lot of data that way, a more automated process is in order. Please note that some scripting skills are required to fully make use of the procedure.

This involves the excellent PDFtk command line utility. After installing it, first make sure that all the PDF files are together in a directory. Then go there with a command prompt and issue the command to read the FDF data

C:\Users\jdoe>cd Desktop\my_filled_in_pdfs ↵
C:\Users\jdoe\Desktop\my_filled_in_pdfs>pdftk myform.pdf dump_data_fields_utf8 ↵
---
FieldType: Text
FieldName: Name
FieldFlags: 0
FieldValue: John Doe
FieldJustification: Left
---
FieldType: Text
FieldName: Phone
FieldFlags: 0
FieldValue: +33-12345678
FieldJustification: Left
---
FieldType: Text
FieldName: Mail
FieldFlags: 0
FieldValue: jdoe@example.com
FieldJustification: Left

As you can see, the form’s FDF data is put out. With a simple script you could iterate over a big bunch of files easily, reaping all the data. Additionally, this plain text format can easily be processed with any scripting language. Here an example written in ruby:

base_dir = ARGV[0] || '.'

data = []

Dir[base_dir + '/*.pdf'].each do |pdf_file|
  raw = `pdftk '#{pdf_file}' dump_data_fields_utf8`
  pdf_data = {} 
  raw.split(/(^|\n)---\n/m).each do |field|
    field_data = {}
    field.split(/\n/).each do |row|
      k, v = row.split(/: /)
      field_data[k] = v
    end
    pdf_data[field_data["FieldName"]] = field_data["FieldValue"]
  end
  data << pdf_data
end

puts data.first.keys.join(';')

data.each do |record|
  puts record.values.join(';')
end

That outputs nice CSV style text which you can import into a spreadsheet. Call the script like this to write its output to a file:

ruby sample.rb C:\Users\jdoe\Desktop\my_filled_in_pdfs > data.csv

Note: Although the commands shown are taken from a Windows 7 Enterprise command prompt, they work also on other platforms like Linux and Mac OS X. It just needs to support PDFtk.