Asset Scraping
By default, Assets in Talk have their metadata scraped when they are loaded. This provides the easiest way for newsrooms to integrate their CMS’s into Talk in a simple way. We use the following meta tags on the target pages that allow us to extract some properties.
Asset scraping is performed by the scraper
job which is enabled by default when you launch Talk. If your production site is behind a paywall or otherwise prevents scraping, you might need to confiugre a TALK_SCRAPER_PROXY_URL or custom TALK_SCRAPER_HEADERS.
Asset Property | Selector |
---|---|
title |
See metascraper-title |
description |
See metascraper-description |
image |
See metascraper-image |
author |
See metascraper-author |
publication_date |
See metascraper-date |
modified_date |
meta[property="article:modified"] |
section |
meta[property="article:section"] |
You can use the ./bin/cli assets debug <url>
command to print the scraped metadata
from that URL. For example:
$ ./bin/cli assets debug https://www.washingtonpost.com/technology/2018/10/30/apple-event-october-ipad-pro-macbook-air/
┌──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Property │ Value │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ title │ Apple redesigns the iPad Pro, breathes new life in the MacBook Air │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ description │ Apple is unveiling new iPads and MacBooks at an event in New York starting at 10 a.m. Fowler is there and will report in with the news and hands-on analysis throughout the day. │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ image │ https://www.washingtonpost.com/resizer/JAwNQE2alL2JjiWrbXeJ46wZHqA=/1484x0/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/G5TWBFW4LAI6RC5MX7QB7TODUY.jpg │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ author │ Geoffrey A. Fowler │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ publication_date │ 2018-10-30T10:40:00.000Z │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ modified_date │ │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ section │ │
└──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
You can use the ./bin/cli assets refresh [age]
to trigger scraping or rescrape assets where the scraper job was unsuccessful.