Structured data and Google
Domain-specific markup for fun and profit
It doesn’t come as a surprise to Dull Old Web Farts (DOWFs) like me to learn last month that Google gives a search boost to sites that use structured data (as well as rewarding sites for being performant and mobile-friendly). Google has brilliant heuristics for analysing the content of sites, but developers being explicit and marking up their content using subject-specific vocabularies means more robust results.
For the first time (to my knowledge), Google has published some numbers on how structured data affects business. The headlines:
- Jobrapido’s overall organic traffic grew by 115%, and they have seen a 270% increase in new user registrations from organic traffic
- After the launch of job posting structured data, Google organic traffic to ZipRecruiter job pages converted at a rate three times higher than organic traffic from other search engines. The Google organic conversion rate on job pages was also more than 4.5 times higher than it had been previously, and the bounce rate for Google visitors to job pages dropped by over 10%.
- In the month following implementation, Eventbrite saw roughly a 100-percent increase in the typical year-over-year growth of traffic from Google Search
- Traffic to all Rakuten Recipe pages from search engines soared 2.7 times, and the average session duration was now 1.5 times longer than before.
Impressive, indeed. So how do you do it? For this site, I chose a vocabulary from schema.org:
These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.
Because this is a blog, I chose the BlogPosting schema, and I use the HTML5 microdata syntax. So each article is marked up like this:
<article itemscope itemtype="http://schema.org/BlogPosting">
<header>
<h2 itemprop="headline" id="post-11378">The HTML Treasure Hunt</h2>
<time itemprop="dateCreated pubdate datePublished"
datetime="2019-05-20">Monday 20 May 2019</time>
</header>
...
</article>
The values for the microdata attributes are specified in the schema vocabulary, except the pubdate
value on itemprop
which isn’t from schema.org, but is required by Apple for WatchOS because, well, Apple likes to be different.
And that’s basically it. All of this, of course, is taken care of by one WordPress template, so it’s automatic.
Metadata partial copy-paste necrosis for misery and loss
One thing puzzles me, however; Google documentation says that Google Search supports structured data in any of three formats: JSON-LD, RDFa and microdata formats, but notes “Google recommends using JSON-LD for structured data whenever possible”.
However, no reason is given for preferring JSON-LD except “Google can read JSON-LD data when it is dynamically injected into the page’s contents, such as by JavaScript code or embedded widgets in your content management system”. I guess this could be an advantage, but one of the other “features” of JSON-LD is, in my opinion, a bug:
The markup is not interleaved with the user-visible text
I strongly feel that metadata that is separated from the user-visible data associated with it highly susceptible to metadata partial copy-paste necrosis. User-visible text is also developer-visible text. When devs copy/ paste that, it’s very easy to forget to copy any associated metadata that’s not interleaved, leading to errors. (And Google will penalise errors: structured data will not show up in search results if “The structured data is not representative of the main content of the page, or is potentially misleading”.)
An example of metadata partial copy-paste necrosis can be seen in the commonly-recommended accessible form pattern:
<label for="my-input">Your name:</label>
<input id="my-input"/>
I’ve lost track how many times I found broken ids, duplicate id/for, ids with two or more values and much more, so I prefer the implicit / wrapped variant.
— Klar Name (@tcaspers) May 5, 2019
I’ve contacted chums in Google to ask why JSON-LD is preferred, but had no reply. (I may go as far as trying to “reach out” next time.)
I’m pretty sure Google prefers JSON-LD over microdata because it’s easier for them to stealborrow the data for their own use in that format. When I was working on a screen-scraping project a few years ago, I found that to be the case. Since then, I’ve come to believe that schema.org is really about making it easier for the big guys to profit from data collection instead of helping site owners improve their SEO. But I’m probably just being a conspiracy theorist.
Speculation and conspiracy theories aside, until there’s a clear reason why I should use JSON-LD over interleaved microdata, I’m keeping it as it is.
Google replies
Updated 23 May: Dan Brickley, a Google employee who is Lord of Schema.org, wrote this thread on Twitter:
1st conversations about https://t.co/ooIuC1elTy in JSON came up via Gmail's "smart mail" features, e.g. Flight boarding passes being marked up to show up in Google Now, smart watches etc.
— Dan Brickley (@danbri) May 22, 2019
(Last Updated on )
Buy "Calling For The Moon", my debut album of songs I wrote while living in Thailand, India, Turkey. (Only £2, on Bandcamp.)
6 Responses to “ Structured data and Google ”
“Microdata” solves nothing and introduces its own problems; in fact opinion it reintroduces some of the problems that HTML5 set out to solve through EAV (Entity Attribute Values).
In contrast, JSON-LD can be validated and loaded directly into database and queries immediately, while microdata will have to be detected and harvested from the DOM, presumably as JSON, before any processing can happen.
Anything that sticks meaningful data in HTML attributes that are invisible in the browser is inviting necrosis, not least because there is no validation step. I still can’t get over the fact that we never got CSS/locale support for time tags.
So will I be penalized for saying following:
JSON-LD: the X information in on the page
HTML: show the X information on :hover in CSS dropdown menu
The JSON-LD-specified information is visible to the user only after user interaction, so it is possible for Google Bot to “discover” you’re lying.
On the other hand I could do:
JSON-LD: the X information in on the page
HTML: show the X information, transparency: 95%
and this way I could cheat SEO.
Don’t Repeat Yourself seems more robust here.
FWIW, I prefer the interleaved nature of microdata or RDFa (and between those two I prefer RDFa) to the separated nature JSON-LD. To my mind it’s like an extra enhancent on top of my semantic HTML so it feels logical to have them close to each other in the code. 🙂
That being said, I can appreciate some of the reasons others like JSON-LD.
I believe all have their uses and should be supported and I hope Google continue to do so. So, it does concern me a tad when Google expresses a preference for one of them – given their history of discontinuing products and services…
I’m curious, what about Microformats? Why does it seem like this open standard is never considered for this type of application? It is in the same vein of structured data markup options, isn’t it? I’ve never understood why it has always been left out of the discussion. I’m curious to know if anyone can explain why.
My preference is for Microdata for many of the reasons outlined in this article however, when checking a site on Google’s Rich Results today I realise it seems to be no longer supported by Google. I can’t find this announced anywhere so I guess this leaves me no choice but to redo everything in JSON-LD.
We ended up using JSON-LD for the An Event Apart web site because everything is being spat out of a CMS, so it was easier to output the structured data in separate blocks than try to interleave it in the markup. In more manually-maintained scenarios, I’d very likely go the interleaving route for the reasons you describe.
Probably unrelated: your RSS feed is only including the article paths instead of absolute URLs, like this: <link>/2019/the-html-treasure-hunt/</link>. Which confuses my RSS reader (NetNewsWire 3!) to the point that it passes just “/2019/the-html-treasure-hunt/” to the browser, which understandably doesn’t know what to do next.