用 nodejs 爬虫获取一个 gb2312 编码的网站的 2 个坑

第一，获取的 html 解析问题，由于 nodejs 内部使用 utf-8，需要将获取的 html 转码，使用 iconv.decode(body, "gb2312") 完成。

第二，由于站点使用的是 gb2312，post 提交也需要用 gb2312 进行 urlencode，这里推荐 urlencode2 这个库，可以实现异编码的 urlencode。但这样还不够，不能让 request 去重复编码，所以需要使用 body 去发送原始 raw 数据，并手动编码。

代码片段如下：

javascript

const request = require("request");
const urlencode = require("urlencode2");
const iconv = require("iconv-lite");
const post_request = request.post(
  {
    encoding: null,
    url: "http://xxx",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
    },
    body: "key=" + urlencode(value, "gb2312"),
  },
  function (error, response, body) {
    console.log(iconv.decode(body, "gb2312"));
  }
);

const request = require("request");
const urlencode = require("urlencode2");
const iconv = require("iconv-lite");
const post_request = request.post(
  {
    encoding: null,
    url: "http://xxx",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
    },
    body: "key=" + urlencode(value, "gb2312"),
  },
  function (error, response, body) {
    console.log(iconv.decode(body, "gb2312"));
  }
);

用 nodejs 爬虫获取一个 gb2312 编码的网站的 2 个坑 ​

用 nodejs 爬虫获取一个 gb2312 编码的网站的 2 个坑