Bruce Lawson’s personal site

Making and generating accessible PDFs

(Last Updated on )

Joe Clark wrote a canonical list of document types that may be delivered as PDF. There is a simple way to ensure that these documents are as accessible as can be: start from a structured Microsoft Office document.

From Word to accessible PDF

Most organisations require certain branding on documents – logos, copyright notices, fonts, whatever. Most content producers in an organisation understand the need for consistency, but hate having to do it, remembering what size point headings should be etc. So I always advocate the distribution of a Word document template that has all the rules defined in it:

The great thing about this template is that giving it to staff means that they can double click on it, and be presented with a new, blank document, with all the logos and branding pre-defined; all they need to do is choose the right heading from the Word formatting toolbar and the styles are applied automatically. (There’s more on making good source documents at WebAIM)

What it means for you is that when you use Adobe Acrobat Professional on a Windows machine to make your PDF, all of this structural information (the level of heading, the fact that you’re in a list) gets passed to the resulting PDF document. Unfortunately, most Assistive Technologies can’t access this strutural information. (Even Adobe’s own “Read Outloud” facility can’t use it, although it’s rumoured that a JAWS beta can navigate headings in PDF.)

PDFs that shouldn’t be PDFs

As well as the categories of PDFs that Joe lists as being “good” PDFs, there are two further categories that I’d add for purely pragmatic reasons, not because they should really be PDF, but because in the insane world of bureaucracy, they have to be. Think of these not as categories, but crapegories.

The first crapegory is the massive amount of shite documents that people have “put on the web” by turning into PDF format because it’s easy to do it and because somebody, somewhere might just want to read the Chief Exec’s 1998 christmas message again. It’s indicative of the sheer rubbishness of Content Management Systems that it’s almost impossible for a non-technical user to publish an accessible html document, but a piece of cake to upload a PDF via a CMS and “put it on the Web”. This category is the “dark matter” of the Web; a huge weight of uselessness that no-one ever sees and which interacts with nothing else.

The second crapegory of “bad” PDFs are often published in the format for one or both of these reasons:

I used to mock these reasons (being Mr Web Standards smartarse purist, and all). But I’ve come to realise that they’re very good reasons for wanting a non-html format. The first makes sense because websites traditionally haven’t printed well; they usually won’t fit properly on a portrait-oriented page, and are full of extraneous crap (advertising, navigation etc).

Now, modern web designers will deliver a print stylesheet, but the majority of users don’t know that. (As an aside, wouldn’t it be jolly if browsers showed a large print button on the toolbar if they found a print stylesheet on a site? That way, users would be alerted to the fact that you can print this webpage, because the designer was a professional).

Similarly, wanting to save a webpage is common, but browsers don’t have a nice “floppy disc” save icon that people are used to from other desktop applications. Even if you know how to save a webpage, it generally gets saved in two lumps: as a .html and a folder of subfiles, which makes it difficult to attach to an email. (Microsoft’s “bagged-up” .mht format is a notable exception,but it requires that you know to choose “File” then “Save as”, then “Web Archive, single file” in Internet Explorer. People don’t know that.)

In such a case, the most accessible method of delivering the content is as html. However, the most usable method may be by offering a PDF. However, it’s version control hell to publish twice, so a great way to square the circle is to generate the PDF on demand from the xhtml.

Generating PDF

There’s lots of ways to generate PDF, but I’m going to focus on one: Extensible Stylesheet Language Formatting Objects, or XSLFO. This allows you to take xml (which is why your content should be valid xhtml, as that is xml) and manipulate and format it any blumming way you please: you can make PDFs with footers, page counts, headers etc on the fly, from your content. (Example)

What this won’t (yet) do is produce a fully-tagged PDF, to the same level of semantic sexiness that you’d do yourself from your carefully-crafted Word document. (There’s no real reason why a good program shouldn’t be able to tag a PDF from the structure of the source xml; after all, all Adobe Acrobat does is translate the structural markup from Word markup to PDF markup; it’s a mechnaical transformation job).

Even so, that doesn’t mean the document is totally inaccessible; as Alastair Campbell notes, in his article The four levels of PDF accessibility. Listening to an on-the-fly generated PDF in a screen reader isn’t much worse than listening to a PDF “hand-made” from a Word document; essentially, you can’t navigate either, but can only listen to it linearly. The “hand-made” version, however, does carry with it the “alt text” if you added in it Word.

But, the main question is, does it matter? If you’re offering the content primarily as accessible, well-structured, semantic xhtml, does it matter if your icing-on-the-cake, printable and saveable PDF isn’t perfectly accessible?

I’d like to thank Jim O’Donnell and others on the Accessify forum for conversations that led to my wriitng this. If there are any errors or typos, blame those bastards. Not me.

20 Responses to “ Making and generating accessible PDFs ”

Comment by AlastairC

“Listening to an on-the-fly generated PDF in a screen reader isn’t much worse than listening to a PDF “hand-made” from a Word document; essentially, you can’t navigate either, but can only listen to it linearly.”

JAWs (and I believe Windows eyes) has had similar navigation mechanisms for PDF as HTML since about version 6. The tags / structure is important for that to work.

The main problem I’ve had recently is when a perfectly ‘marked up’ document still gets mangled by Acrobat (7). I.e. content is missing or in the wrong place.

However, if it is provide as a print version of an accessible online document, you’d be hard pressed to claim accessibility issues. Especially since there isn’t a server side product that automatically creates tagged PDFs (that I know of), which seems strange as you would think it is easier.

Comment by Bruce

Perhaps I laid it on a bit thick there, Alastair. I’m looking at information by Andrew Kirkpatrick of Adobe, (in the PDF chapter of the accessibility book) which states that the ability to navigate by headings and lists was added in JAWS 7, and (at time of writing), not available in Window-Eyes Home Page Reader 3.04.

You’re right that the big screenreaders all cope with links, tables, forms and paged navigation.

Comment by Jim

which is why your content should be valid xhtml, as that is xml

Tiny pedantic point – your content should be well-formed xhtml, as that is XML. XML doesn’t require validity.

Comment by Bob Esaton

Very fine recommendations. Now, if you could tell us how to avoid a common “reading order” problem, some PDFs will be better.

I frequently find that JAWS does a good job interpreting an untagged PDF when I allow it to use “inferred reading order.” However, some of these PDF files have a running footer on the bottom of every page (containing date and page number), and a logo graphic at the top of every page. So, we’re reading along and a run into the sentence that spans a page boundary. In the midst of a sentence we hear “…zero six dash one four dash two thousand and six page three graphic …” It would be a real good thing to avoid that stuff.

I’m guessing there are two ways: don’t include such matter, or mark up the PDF correctly. Other thoughts?

Comment by bruce

don’t include such matter, or mark up the PDF correctly. Other thoughts?

None, I’m afraid Bob. I’d leave out such matter from generated PDFs.

Comment by Alastair Campbell

Using Word (Office 2003) and Acrobat 7, I’ve found that the headers and footers are not reader out, they are (automatically) made into artifacts. That is in a tagged document.

In older versions we used to make a version of the word document without the header & footer. If you were auto-generating from XML I guess leave it out? The reader has the page numbering built in anyway…

Comment by Grant Broome

I’d strongly disagree with Joe’s statement about creating interactive forms in PDF (number 2 on his list). Using Adobe Acrobat 7.0 Professional, it is IMPOSSIBLE to confidently create radio buttons that will be read by the latest versions of JAWS regardless of the source document (not just talking Quark here). You would need to use a not-very-well-known JavaScript hack for check boxes instead. There are other aspects of Acrobat that make it a nightmare to work with. disappearing text and a read order panel with a mind of it’s own make complex forms in PDF not worth the effort IMO. Adobe have made a good start to be fair but they still have a lot of work to do.

Comment by Caz Mockett

Some great tips here, Bruce. Thanks for the article.
I’ve been looking for some info on accessible PDFs for a while now, so this was very timely.

As for doing PDF forms – well, I didn’t know you could! So another can of worms to open up (when I have a few spare months to get to grips with them…)

Comment by Jim

Another pedantic point – dark matter in galaxies, and galaxy clusters, interacts with everything. That’s how we infer its existence, even though we can’t see anything there.

I like ‘dark matter of the web’ as a description of PDF, though.

Comment by Bruce

Gah! You choose a smartarse cosmologiy metaphor, and there’s always a fucking astrophysicist waiting in the wings to pick holes in them, eh. And to think I was at this moment making your CD, Jim….

Comment by Logo Designer Professional Online

Avoid 6 Common Logo Design Mistakes…

A logo is the ever-present face of your company. Words, type, colors and shapes &#8212these attributes help a logo to communicate a message to potential customers about what your company does and how your company is different from the competition….

Leave a Reply

HTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> . To display code, manually escape it.