Ideas for improving metaformats parsing
Why metaformats
I first became interested in the metaformats spec when I wanted to generate link previews on my site when bookmarking or replying to pages. Metaformats is an extension to the microformats2 parsing algorithm that provides a fallback for sites that don’t support microformats. It works by translating the meta tags into microformats properties. This is great, because it makes microformats useful as the vocabulary for parsing, even if a page doesn’t use microformats itself. The challenge is in how to interpret meta tags, which are not consistently used and often cater to SEO or social media site engagement rather than accurately representing the content.
Deviations from the spec
When I began work on implementing metaformats for the JavaScript microformats-parser library, I quickly began to question whether the spec was taking the best approach. Because metaformats is an experimental extension, the library maintainer and I decided to take some liberties interpreting the spec. So far this has seemed to work fine, but ultimately I’d like to see multiple, consistent implementations so I’d like to discuss some possible spec changes.
It’s also worth noting that the spec is a living document and subject to change. For reference, most of my initial work was done based off of an April 2022 revision.
Only parse as a fallback
The first question we addressed was when to parse for metaformats.
When implemented, the spec only skipped metaformats parsing if there were no root classes on the html
element.
However, adding metaformats to a page with microformats could lead to multiple h-entry
s in a document, confusing any downstream microformats logic.
There’s a more recent proposal to parse metaformats as a separate property from the true microformats on the page.
I suggest only parsing meta tags if there are no microformats found on the page.
This avoids the “multiple entries” problem from the original spec and fits with the goal of metaformats being a fallback definition.
This design could potentially have issues if sites have incomplete microformats, such as those from a Wordpress template.
If this issue is common, a quality heuristic such as check that an h-entry
has properties, could be defined to weed out bad microformats but I wasn’t sure what to check for.
Add additional fallback options
The meta elements checked by the metaformats spec are limited to Twitter cards (like [name="twitter:description"]
) and Open Graph Protocol.
However there are common parts of the HTML document <head>
like the <title>
element and common meta tags like [name="description"]
that are even more common than Open Graph Protocol that could be used.
Some of these may suffer from SEO keyword stuffing, but they are better than having no value.
Imply h-entry on all pages
When I implemented metaformats parsing, I used h-entry
as a fallback root class for all pages that aren’t a more specific type (like an og:type
of “profile”).
While the spec will usually use h-entry, it does require some Open Graph Protocol or Twitter card exist tags to parse anything.
With additional fallback tags like the <title>
element, this logic seemed overly complex.
The spec could be simplified to use h-entry
by default if a more suitable type isn’t found and then check for no properties for the rare case that no relevant meta tags are used.
Clarify impact of parsing order
It could be my misunderstanding, but the spec was not clear about how order impacts parsing.
I think the the intention is that the tags are looked at together, so if both are found, only the Open Graph Protocol og:title
is parsed, not the twitter:title
.
My approach was to collect all meta tags into a map before trying to determine which one to use.
This removes the issue of order where possible.
However, I did learn that the Open Graph Protocol depends on order in some cases, such as complex properties that can be defined multiple times (array types) or with media tags.
Handle variations of media tags
The media tags seemed like one of the more hairy parts of metaformats parsing mainly because of how flexibly the Open Graph Protocol defines them.
The OGP says that og:image
, og:image:url
, and og:image:secure_url
are all valid ways to define an image URL and gives examples where og:image:secure_url
is used alongside the other.
Images can also have alt text via the og:image:alt
property.
I’ve created a spec GitHub issue specifically to add alt text parsing.
Further complicating things, tags like og:image
can be defined multiple time to create a list of object definitions, so a sub-property like alt text needs to be tieds to the last og:image
, not any image.
The spec currently only recognizes the og:image
as a u-photo
and therefor misses both alt-text and situations where a site might only use og:image:secure_url
but does use the simpler og:image
(not sure if this is common).
The implementation I used only deduplicated by URL, but many sites like Wikimedia appear to have multiple og:image
for the same photo, but with different sizes which the current parser would consider different images.
This area in particular warrants further investigation so see what patterns are actually in use across websites.
Parse images as u-featured
The spec says to parse meta[property="og:image"]
tags (and the Twitter card alternative) as u-photo
properties.
However, the post-type-discovery algorithm says that the presence of a “photo” property implies that it is a photo post.
However, a large part of the internet includes og:image
to improve the display of links to their site on social media.
For example, GitHub generates images for pages on its’ site, even though you wouldn’t consider a GitHub issue a photo post.
As an alternative, I suggest that these image properties be parsed as u-featured
properties instead.
This will be more in-line with the use of the meta tags.
This is also how the OpenGraph to Microformats2 package implemented it’s parsing.
While writing this post, I realized that for root classes other than h-entry
(particularly h-card
) parsing as u-photo
might still be preferable.
The implementation
With the above context described, here is a simplified description of how the javascript metaformats parser was implemented. This is written more as illustrative pseudocode than spec so there are some details left out. To see the exact code, check out the microformats-parser PR.
- Attempt to parse microformats as normal
- If items are found end, if not, parse the metaformats as follows
- Find the
<head>
element and parse all it’s children into an object mapping key/value pairs as follows- If it is a
<meta>
tag with acontent
attribute, store the content value as the value with the following key- If it has a
property
attribute (as used by Open Graph Protocol), use that as the key - (And/or) If it has a
name
attribute, use that as the key - If the key starts with
(og|twitter):(image|video|audio)
- If the key ends with
alt
, look for a value matching the prefix (e.g.og:image
) and change that value to an object of the form{value: existingValue, alt: contentOfAltTag }
- Else if the key ends with
url
, simplify the key to just the prefix (e.g.og:image:secure_url
becomesog:image
)
- If the key ends with
- If a value exists for that key and the new value is different (also checking for image object form), then store both values in an array.
- If it has a
- If it is a
<title>
element, store it with a unique key and use it’s inner value as the value - If it is a
<link>
element with acanonical
attribute, store it with a unique key and use it’shref
attribute value as the value
- If it is a
- Determine the root class as follows
- If the
og:type
is “profile”, then useh-card
- If the
og:type
is “music” or “video”, then useh-cite
- Note that there’s a suggestion to infer
h-card
if URL is the root of the domain - Else use
h-entry
- If the
- Parse the properties as follows
- Set
name
to the first value found for the keys:og:title
,twitter:title
, the inner text of the<title>
element - Set
summary
to the first value found for the keys:og:description
,twitter:description
, ordescription
- Set
featured
to the first value found for the keysog:image
ortwitter:image
- Note:
featured
should probably bephoto
for anything other thanh-entry
- Note:
- Set
video
to the first value found forog:video
ortwitter:video
- Set
audio
to the first value found forog:audio
ortwitter:audio
- Set
published
to the first value found forog:published_time
ordate
- Set
updated
to the first value found forog:modified_time
- Set
author
to the first value found forarticle:author
orauthor
- Set
url
to the first value found forog:url
or the value of the canonical link tag - Set
publication
to the first value found forog:site_name
orpublisher
- If the inferred type is
h-card
- Set
given-name
to the value found for `profile:first_name - Set
family-name
to the value found for `profile:last_name - Note:
profile:username
might be worth parsing asp-nickname
. It’s used on mastodon profiles
- Set
- Set
- If any properties are found, return the object as the parsed microformats item
More analysis needed
To settle on a metaformats algorithm that works with a broad portion of the internet, more analysis of meta tag usage is needed. Looking at the details of how sites implement metadata has given me lots of question. The definition of the Open Graph Protocol left a lot of room for interpretation, but the most common use of OGP for social media only revolves around a few properties. A more complex algorithm could provide a more accurate representation of sites but with questionable value more possibilities for misinterpretation. A simpler algorithm might work well enough for a simpler use case.
Some questions for further analysis:
- Do we need to include Twitter Cards? How often are they used without the related open graph protocol tags? If they are both used, how do they differ in value?
- How often do sites use
name
andproperty
attributes on the same meta tag and are they ever describing different properties? (e.g.<meta name="og:title" property="twitter:description" content="Why not both?" />
) - How usable are some of the more generic metadata properties (
<title>
,description
,author
etc)? Are they worthwhile fallbacks? - What variations on media tags are actually used in the wild (multiple photos, videos, audio, secure urls with different paths, alt text, different tag names)?
- What other metadata formats are actually used and do they provide additional value? How would you even interpret JSON-LD metadata?
- How are authors used in meta tags? Do sites define multiple authors?
- How are canonical urls usually defined? Do
og:url
andlink[rel=canonical]
ever differ? - If metaformats are used primarily for link previews, does it make sense to focus parsing around
h-cite
properties likepublication
I hope to do further analysis of some common sites to answer some of these questions. My goal would be to provide some clarity on how meta tags are actually being used so parsing works well outside of the microformats internet bubble. I’d also like to hear what use cases do others see for metaformats. What else is missing?
Responses
See how to respond...
Respond via email
If you'd prefer to message me directly, send an email. If you'd also like your message to be visible on the site I can add it as a comment.
Reply via Email
Respond from another site
Responses are collected from posts on other sites. Have you posted somewhere that links to this page? If so, share the link!