How Google ranks PDF: final conclusions

Last 26th of October I started a test to monitor how PDF documents are crawled and indexed by search engines.

The test aimed to let me understand the following things about PDF documents like:

  • Do Microsoft Word« headers (normally used into a professional document) make a difference?
  • How much the Keyword Density in the document impact on its ranking?
  • How much document properties (Title, Author, Comments and Keywords) influence the indexing?

After just one week, Google started to show some concrete results. The other search engines are still looking around; only Ask return a couple of result, but anything that I can talk about.

Scope of this document is to highlight the fluctuation that the different PDF documents made (13 in total) made day-by-day.

Those are the first – most interesting – results I collected during the past weeks for the main document returned by a SERP generated using the URK “seiunamicone”.

pdf results

Followed by the hidden results, those one that can be seen expanding the hidden menu (click on plus symbol).

PDF document indexed

I’ve been monitoring the SERP for a while, and apart from the first days where it has been continuous fluctuations, now the results seem to be stabilized.

Proposing a lot of images it would probably have been muddle-headed, however I can assure you that a lot of changes took place, and I presume that looking at the SERP tomorrow, something new could be highlighted.

Just to give you an example outside the above picture, on 3rd of November the third result was a document called PDF-test-without-headers-KD43.pdf – my test n. 11. To be honest have shown it there get me confused e I wasn’t able to figure out how it was possible.
That’s the reason for which I included a graph collecting the different SERPs changes.

This is the full graph.

All the results of my pdf test

Whilst this is a graph with the documents that just take part on the SERP during the period in which I monitored it.

The document that have been indexed in Google

Let’s analyze it altogether, but first let me remind you something about the documents generated. I assumed a KWD of the URK “seiunamicone” split between the page (42%) and the document properties (56%) and fake headers when H1 and H2 have been created using pure emphasis instead of Word styles.

The first PDF to be indexed has been a document called Test 7 (PDF-test-without-header2-KD100.pdf). This document contains an H1 made using Word styles, a fake H2 – just emphasized text – with a KD of 100%. Just after some days, this document has been completely refused by Google SERP. Today is in the index but sit nowhere.
A snugly result for test number 5, 28% KD and one header, whilst no index at all for test 3, 10 or 13 for example.

If we would like to analyze only the first three results (first one and it’s aggregate) plus the first result shown when expanding hidden results we got the following picture.
Positive results has been collected for test number 12, always been present in the SERP and now stable on position 1 from about one week, test number 1.1, with some fluctuation, but now stable on position 2, and finally test 11, that apart some daily disappear has always been on position three.

So what makes the difference for these documents?

I almost sure Google is able to interpret the RTF code contained into PDF document (most probably doing a sort of reverse engineering). This sounds like strong assert (and maybe it is, so please take it just as my personal opinion) but it’s the only explanation I was able to find when I answered to the question “Why these?”

Analyzing the SERPs, I saw that after a KWD factor, the headers get their own importance., so, today what could be the answer about the following question?

What are the predominant factors that influence a PDF indexing into Google?

If today I should frankly answer to this dilemma, I would probably got through with the following bulleted point:

  1. Document properties usage. Adding the keyword(s) into the document properties (Title, Subject e Keywords – Comments are ignored. We can even use the Author field, but looks like to be used for different purposes, isn’t it?)
  2. Keyword density. A sufficient number of keyword in strategic part of the document – as per HTML pages – results in a better-optimized document, especially when headers are used. But we remember of another important aspect, such as document length and size that after a certain dimension (100k) results in a non-crawled text.
  3. Header usage. Inserting keywords into the header 1 (made with Word styles, not emphasizing the text) boost the document and help it for a better indexing. Eventually use an H2 sounds good, but during the tests I noticed that use both of them don’t get any extra advantage.
  4. Keyword proximity. Whenever the headers are not used, keyword proximity plays an important role.

I believe I don’t forget anything and I hope you enjoyed reading this post.

Technorati Tags: google, search engine indexing, pdf