Sunday, May 4, 2014

Facebook scraping Blogger post descriptions

It is a very long time since I've posted here. Really this blog is reserved for truly nerdy "ahah" moments and I've just had one. 

When you plug a blog URL into a Facebook post FB goes away and "scrapes" the URL looking for certain information which it then uses in the post. Two items matter:

  1. A picture that comes from the blog post along with arrows to select if there's more than one and the ability to upload something completely different.
  2. Some text which is highlighted and can be changed. In theory this text comes from the blog post.
The problem with a Blogger Blog is that this text is the same for every post - it is simply the blog's description text

Of course I read around this. If the blog contains Open Graph tags then the og:description tag content will be used but this means that every post will need to be html-editted to add this tag with an appropriate value. Not good - we want something that automatically applies to every post.

In the absence of og tags the scraper looks for the first occurrence of a <p> tag and uses the text it finds there.

That leads to the first, smaller "ahah" moment The reason the blog description keeps getting found is that it is within a <p> tag. At the foot of this post you will find some suggestions about working with Blogger HTML but for now take it on trust. This fragment, using a <p> tag, causes the scraper always and only to find the description text:
  <div class='descriptionwrapper'>
    <p class='description'><span><data:description/></span></p>
If the <p> tags are changed to <div> tags this is fixed:

 <div class='descriptionwrapper'>
    <div class='description'><span><data:description/></span></div>

So far so good but the next bit of <p> tag enclosed text is in the comments header and we don't want that either. What we want is the first paragraph or so from the post body. Start by locating this piece of HTML:

      <div class='post-body entry-content' expr:id='&quot;post-body-&quot; +' itemprop='articleBody'>

This is the occurrence of post.body that actually appears online. If you modify this as below all posts on the blog will be scraped appropriately,

      <div class='post-body entry-content' expr:id='&quot;post-body-&quot; +' itemprop='articleBody'>
       <p> <data:post.body/></p>

Working with Blogger HTML

  1. Use Template/Backup to backup your existing template in case you blow it
  2. Use Template/Edit HTML
  3. Note that sections are collapsed which makes searching for stuff difficult - but there are line numbers
  4. Use an editor with line numbers (I use Context) to edit the backup XML file and find things. Note the  line number. Do not modify this file.
  5. Find the line number in the Blogger editor, make changes, and save the template
  6. Try starting a FB post for one of your posts - use the full post URL

No comments: