外链美化实践：自动生成网站预览卡片

NextJS NextJS实战

🧑‍💻

推荐全栈学习资源：

Next.js 中文文档：样式和官网一样的中文文档，创造沉浸式Next.js中文学习体验。

《Chrome插件全栈开发》：真实出海项目的实战教学课，讲解Chrome插件和Next.js端的全栈开发，帮助你半个月内成为全栈出海工程师。

群友问：输入一个网址，获取网站标题、描述和卡片的功能怎么做？

question

正好最近我也在考虑给落地页添加一个「Showcases」模块，用于展示使用了我的落地页模板的网站。问题对口，开始思考！

学过《爬虫从入门到入狱精通》的开发者都知道，网站的所有基本信息可以在 metadata 信息里提取。这样目标就很明确了，只要写出一个工具类，实现抓取 metadata 信息，再分析提炼，返回页面所需的信息就可以自定义渲染样式了。

本文基于 Next.js app router 实现，读完本文你能学到：

预览卡片需要展示哪些信息
模拟浏览器请求读取 dom
分析 metadata 并读取关键信息

知识点很少，实现单一功能的工具类就是这么简单。

预览卡片需要展示哪些信息

metadata 对于网站和搜索引擎（或者说爬虫）来说是双向奔赴的元素，更完善的 metadata 可以让搜索引擎获取更多有价值的信息，反过来，有价值的信息越多，搜索引擎就越可能把网站标记为高价值的网站，让更多人搜索到，提升网站流量。

也就是说，如果一个网站希望被更多人看到、如果你想获取这个网站的信息，你只要去看 metadata 就够了。

对于用于生成预览卡片，metadata 可能提供的信息会有这么多：

标题（Title）：网页的主标题
描述（Description）：简短的网页内容概述
图片（OG Image）：metadata 里的 open graph image 或者 twitter card image
URL：网站地址
favicon：网站的logo
发布日期：对于新闻或博客文章特别有用
作者信息：对于个人博客或新闻网站很有帮助
阅读时间估计（对于文章类网页）：给读者一个大致的时间预期
主题或分类：帮助用户快速了解内容类型
社交媒体互动数据：如点赞数、分享数等
简短的内容预览：比描述更详细一些的内容摘要

但绝大部分预览卡片不会这么复杂，我们抓取这几个信息就够了：

标题
描述
OG Image
favicon

本文的分享也围绕获取这几个关键信息来实现工具类。

实现工具类

重申一遍这个工具类解决的需求：传入网址 URL，返回预览卡片所需的信息。

实现这个工具类有几个步骤：

判断传入的 URL 是否符合规则
模拟浏览器请求，抓取 DOM
分别从 DOM 信息分析并读取 title、description、favicon 和 OG image
返回抓取到的关键信息

先写一个方法的入口：

// @lib/metaScraper.ts
 
  async function scrape(url: string): Promise<ScrapedData> {
    // 根据 url 抓取信息
    // ……
    return {
      url,
      title: 'TODO',
      description: 'TODO',
      logo: 'TODO',
      og: 'TODO',
    };
  }

判断传入的 URL 是否符合规则

写一个方法，判断传入的 URL 是否符合规则：

function normalizeUrl(url: string): string {
  // 匹配有效的URL格式
  const urlPattern = /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;
 
  if (!urlPattern.test(url)) {
    throw new Error("Invalid URL format");
  }
 
  // 如果URL没有协议,添加 'https://'
  if (!url.startsWith('http://') && !url.startsWith('https://')) {
    url = 'http://' + url;
  }
 
  return url;
}

在 scrape 里调用 normalizeUrl：

  async function scrape(url: string): Promise<ScrapedData> {
    // 获取 normalizedUrl
    let normalizedUrl;
    try {
      normalizedUrl = normalizeUrl(url);
    } catch (error) {
      console.error(`Error normalizing URL ${url}: ${(error as Error).message}`);
      throw error;
    }
 
    // 根据 url 抓取信息
    // ……
 
    // 返回预览信息
    return {
      url: normalizedUrl,
      title: 'TODO',
      description: 'TODO',
      logo: 'TODO',
      og: 'TODO',
    };
  }

模拟浏览器请求，抓取 DOM

模拟浏览器请求可以直接用 fetch；抓取 DOM 建议使用 jsdom 依赖包，它可以在 Node 环境模拟浏览器访问网站，然后解析 HTML 并创建虚拟 DOM 树。

  async function scrape(url: string): Promise<ScrapedData> {
    // 上一步获取 normalizedUrl 的代码
    // ……
 
    // 根据 url 抓取信息
    const response = await fetch(normalizedUrl, {
      method: 'GET',
      headers: {
        'User-Agent': 'ModernMetaScraper/1.0 (+https://landingpage.weijunext.com/; weijunext@gmail.com)'
      }
      redirect: 'follow',
    });
 
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }
    const html = await response.text();
    const dom = new JSDOM(html);
    const document = dom.window.document;
    // ……
 
    // 返回预览信息
    return {
      url: normalizedUrl,
      title: 'TODO',
      description: 'TODO',
      logo: 'TODO',
      og: 'TODO',
    };
  }

在请求代码里，有两个要点：

我们添加了 User-Agent 头，按照最佳实践留下了联系信息，这样如果网站方不愿意被抓取信息可以联系到我们。
我们设置了redirect: 'follow'，这表示抓取代码会自动跟随网站重定向

如果要确认获取的 document 对象是否和页面对应，可以通过 console.log(console.log(document.documentElement.outerHTML)) 打印出来的信息做对比。

分析并读取 title、description、favicon 和 OG image

在任意网站里，title、description、favicon 和 OG image 都可能存在于 head 里的多个位置。

以 title 为例，除了可能在 <title> 标签里，还可能在 meta 的 og 信息里，或者 twitter card 信息里，甚至可能就是一个 h1 标签。基于这样的分析，我们可以实现一个抓取 title 的方法：

function extractTitle(document: Document): string {
  return document.querySelector('title')?.textContent ||
    document.querySelector('meta[property="og:title"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:title"]')?.getAttribute('content') ||
    document.querySelector('h1')?.textContent ||
    '';
}

类似的，可以得到抓取 description、favicon 和 OG image 的方法：

function extractDescription(document: Document): string {
  return document.querySelector('meta[name="description"]')?.getAttribute('content') ||
    document.querySelector('meta[property="og:description"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:description"]')?.getAttribute('content') ||
    document.querySelector('p')?.textContent ||
    '';
}
 
function extractLogo(document: Document, baseUrl: string): string {
  const logoUrl = document.querySelector('link[rel="icon"]')?.getAttribute('href') ||
    document.querySelector('link[rel="shortcut icon"]')?.getAttribute('href') ||
    document.querySelector('meta[property="og:image"]')?.getAttribute('content') ||
    '/favicon.ico';
 
  return new URL(logoUrl, baseUrl).href;
}
 
function extractOgImage(document: Document, baseUrl: string): string {
  const ogImage = document.querySelector('meta[property="og:image"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:image"]')?.getAttribute('content') ||
    '';
 
  return ogImage ? new URL(ogImage, baseUrl).href : '';
}

这样就能返回预览信息了：

  async function scrape(url: string): Promise<ScrapedData> {
    // 上一步获取 normalizedUrl 的代码
    // ……
 
    // 根据 url 抓取信息
    // ……
 
    // 返回预览信息
    return {
      url: normalizedUrl,
      title: extractTitle(document),
      description: extractDescription(document),
      logo: extractLogo(document, normalizedUrl),
      og: extractOgImage(document, normalizedUrl),
    };
  }

一个完整的 scrape 方法就完成了。

页面使用

因为我们是在 Next.js 项目里开发，而且上面的工具类需要运行在 Node 环境，所以我们可以在 Next.js 的服务端组件里调用：

 
const metaScraper = scrape();
 
const Showcase = async ({ id, locale }: { id: string; locale: any }) => {
 
  const previewInfo = await metaScraper.scrape('https://landingpage.weijunext.com/');
 
  console.log(previewInfo)
 
  // ……
})

这里拿到 previewInfo 就可以在页面上渲染了。

升级版本

验证成功后，我觉得这样还不够方便，因为我需要展示很多使用案例，我想增加两点需求：

feature

传入多个网址，可以返回预览信息的数组
支持自定义 metadata 信息，即如果传入的信息里存在自定义的 metadata 信息，则程序不去抓取，而是使用自定义值

实现起来也很简单，第一个需求，只要使用 promise.all 去批量处理 map 出来的 promise 方法，再修改一下返回结构就可以；第二个需求只要在抓取信息前判断是否已有自定义值就可以。

是闲出来的完整代码如下：

import { JSDOM } from 'jsdom';
 
interface ScraperOptions {
  timeout?: number;
  maxRedirects?: number;
  headers?: Record<string, string>;
}
 
interface UrlInput {
  url: string;
  title?: string;
  description?: string;
  logo?: string;
  og?: string;
}
 
interface ScrapedData {
  url: string;
  title: string;
  description: string;
  logo: string;
  og: string;
}
 
const defaultOptions: Required<ScraperOptions> = {
  timeout: 10000,
  maxRedirects: 5,
  headers: {
    'User-Agent': 'ModernMetaScraper/1.0 (+https://landingpage.weijunext.com/; weijunext@gmail.com)'
  },
};
 
export function createModernMetaScraper(options: ScraperOptions = {}) {
  const scrapeOptions: Required<ScraperOptions> = { ...defaultOptions, ...options };
 
  async function scrapeMultiple(inputs: UrlInput[]): Promise<ScrapedData[]> {
    return Promise.all(inputs.map(input => scrapeOrUseProvided(input)));
  }
 
  async function scrapeOrUseProvided(input: UrlInput): Promise<ScrapedData> {
    const { url, title, description, logo, og } = input;
    const needsScraping = !title || !description || !logo || !og;
 
    if (!needsScraping) {
      return { url, title, description, logo, og } as ScrapedData;
    }
 
    const scrapedData = await scrape(url);
 
    return {
      url,
      title: title || scrapedData.title,
      description: description || scrapedData.description,
      logo: logo || scrapedData.logo,
      og: og || scrapedData.og,
    };
  }
 
  async function scrape(url: string): Promise<ScrapedData> {
    let normalizedUrl;
    try {
      normalizedUrl = await normalizeUrl(url);
    } catch (error) {
      console.error(`Error normalizing URL ${url}: ${(error as Error).message}`);
      throw error;
    }
 
    try {
      const response = await fetch(normalizedUrl, {
        method: 'GET',
        headers: scrapeOptions.headers,
        redirect: 'follow',
      });
 
      if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
      }
      const html = await response.text();
      const dom = new JSDOM(html);
      const document = dom.window.document;
      // console.log(console.log(document.documentElement.outerHTML));
 
      return {
        url: normalizedUrl,
        title: extractTitle(document),
        description: extractDescription(document),
        logo: extractLogo(document, normalizedUrl),
        og: extractOgImage(document, normalizedUrl),
      };
    } catch (error) {
      console.error(`Error scraping ${normalizedUrl}: ${(error as Error).message}`);
      throw error;
    }
  }
 
  return {
    scrapeMultiple,
    scrape,
  };
}
 
function normalizeUrl(url: string): string {
  // 匹配有效的URL格式
  const urlPattern = /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;
 
  if (!urlPattern.test(url)) {
    throw new Error("Invalid URL format");
  }
 
  // 如果URL没有协议,添加 'https://'
  if (!url.startsWith('http://') && !url.startsWith('https://')) {
    url = 'http://' + url;
  }
 
  return url;
}
 
function extractTitle(document: Document): string {
  return document.querySelector('title')?.textContent ||
    document.querySelector('meta[property="og:title"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:title"]')?.getAttribute('content') ||
    document.querySelector('h1')?.textContent ||
    '';
}
 
function extractDescription(document: Document): string {
  return document.querySelector('meta[name="description"]')?.getAttribute('content') ||
    document.querySelector('meta[property="og:description"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:description"]')?.getAttribute('content') ||
    document.querySelector('p')?.textContent ||
    '';
}
 
function extractLogo(document: Document, baseUrl: string): string {
  const logoUrl = document.querySelector('link[rel="icon"]')?.getAttribute('href') ||
    document.querySelector('link[rel="shortcut icon"]')?.getAttribute('href') ||
    document.querySelector('meta[property="og:image"]')?.getAttribute('content') ||
    '/favicon.ico';
 
  return new URL(logoUrl, baseUrl).href;
}
 
function extractOgImage(document: Document, baseUrl: string): string {
  const ogImage = document.querySelector('meta[property="og:image"]')?.getAttribute('content') ||
    document.querySelector('meta[name="twitter:image"]')?.getAttribute('content') ||
    '';
 
  return ogImage ? new URL(ogImage, baseUrl).href : '';
}
 
export type ModernMetaScraper = ReturnType<typeof createModernMetaScraper>;

页面调用方法：

import { createModernMetaScraper, ModernMetaScraper } from "@/lib/metaScraper";
 
const scraper: ModernMetaScraper = createModernMetaScraper();
 
export const showcases = [
  {
    title: 'Landing Page Boilerplate',
    description: 'A free, open-source, and powerful landing page boilerplate, ideal for various projects, enabling you to create a landing page in under an hour.',
    url: 'https://landingpage.weijunext.com',
    logo: '',
    og: 'https://landingpage.weijunext.com/og.png'
  },
  {
    url: 'https://PHCopilot.ai/'
  }
]
 
const Showcase = async ({ id, locale }: { id: string; locale: any }) => {
 
  const sites = await scraper.scrapeMultiple(showcases);
 
 
  console.log(sites)
 
  // ……
})

实际效果可以到 Landing Page Boilerplate 查看。

结语

本文的实现效果主要是用于解决自己项目的需求，如果你对展示网站预览卡片需求、场景研究比较深入，可以基于本文实现的工具类二次开发一个普适版本发布 npm 包，相信会有不少下载。

关于我

我是一名全栈工程师，Next.js 开源手艺人，AI降临派。

今年致力于 Next.js 和 Node.js 领域的开源项目开发和知识分享。

欢迎在以下平台关注我：

Twitter(中文): @weijunext
Twitter(English): @wayne_dev
Github: Github
Blog: J实验室
掘金: 程普
知乎: 程普
即刻: BigYe程普
微信公众号: 「BigYe程普」
微信交流群: 全栈交流群