阿里系电商商品信息采集思路与实践

记录爬取淘宝的一些心得

阿里系产品

反爬力度大：淘宝、天猫、1688……

反爬力度小：闲鱼、一淘、……子平台

爬虫方式

模拟请求数据：requests，优点：速度快缺点：不适用反爬超强的网站

模拟浏览器操作：webdriver，优点：数据请求完整，被封几率小缺点：爬取速度慢

反爬方式

ip、请求频率、设备（代码端限制、伪waf ）
账号、验证码（cookie）
JS （检测脚本）
Font-face （大众点评）
Background （猫眼数据）
字符穿插（爬取结果包含脏数据）
元素覆盖（隐藏数据）

如何突破反爬

代理
伪装更完整的数据（ UA、Referer、Data）
Api接口
移动端（https://m.xx.com/）
反编译JS

阿里系平台选用爬取方式

requests？

webdirver？

解Xsign算法？通过app筛选数据2000条

模拟浏览器(webdriver)

两大核心：Chrome与Firefox IE

快慢之分，Chrome启动比Firefox不止快多少倍

浏览器特征：

from selenium import webdriver
options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=options)
browser.get('https://www.zut.edu.cn/')

# browser.close()

模拟浏览器

1 2	window.navigator.webdriver true

真实浏览器

1 2	window.navigator.webdriver undefined

隐藏属性

1	options.add_experimental_option('excludeSwitches', ['enable-automation'])

阿里系检测其他属性（隐蔽）

1
2
3

'selenium-evaluate', 'webdriverCommand', 'webdriver-evaluate-response', '__webdriverFunc',
'__webdriver_script_fn', '__$webdriverAsyncExecutor', '__lastWatirAlert',
'__lastWatirConfirm', '__lastWatirPrompt', '$chrome_asyncScriptInfo',

爬取淘宝数据

0x01 登录

运行模拟浏览器—>空白

手动登录
模拟登录 [淘宝正面登录、三方微博账号登录]
模拟浏览器注入cookie
模拟浏览器加载Profiles文件

模拟登录

暂停程序手动输入
获取input，程序自动输入

问题：淘宝登录频繁出现滑动验证？

程序注入登录后的cookie绕登录
加载Profiles文件绕过登录

不足：时间长后cookie失效。

对抗滑动验证

1
2
3

action = ActionChains(self.browser)
action.click_and_hold(getcache).perform()
.move_to_element_with_offset(to_element=getcache,xoffset=288, yoffset=0).perform()

在操作滑动验证的时候永远滑动失败？

现象：在真实的浏览器上一滑就过，操作模拟浏览器永远是失败。

原因：阿里系的其中一个反爬机制、检测用户是不是正常用户、检测浏览器属性。如何让阿里检测不出我是模拟浏览器？

模拟浏览器自带属性，此时代码层已经难以去除特有属性。

使用中间人代理移除属性

Mitmproxy

中间人代理

修改请求响应包

与python容易结合

展示抓包内置

安装方式：
官网下载 https://百度一下.com
Python 包安装 pip install Mitmproxy

去除阿里检测浏览器属性的js，修改无头浏览器属性

# -*- coding: utf-8 -*-
from mitmproxy import ctx
def response(flow):
       if 'um.json' in flow.request.url or ‘xxx.js' in flow.request.url or '/sufei_data/3.6.12/index.js' in flow.request.url:
              # 屏蔽selenium检测
              flow.response.text = flow.response.text + 'Object.defineProperties(navigator,{webdriver:{get:() => false}}); '

       for webdriver_key in ['webdriver', '__driver_evaluate', '__webdriver_evaluate', '__selenium_evaluate',
                                            '__fxdriver_evaluate', '__driver_unwrapped', '__webdriver_unwrapped', '__selenium_unwrapped',
                                            '__fxdriver_unwrapped', '_Selenium_IDE_Recorder', '_selenium', 'calledSelenium',
                                            '_WEBDRIVER_ELEM_CACHE', 'FirefoxDriverw', 'driver-evaluate', 'webdriver-evaluate',
                                            'selenium-evaluate', 'webdriverCommand', 'webdriver-evaluate-response', '__webdriverFunc',
                                            '__webdriver_script_fn', '__$webdriverAsyncExecutor', '__lastWatirAlert',
                                            '__lastWatirConfirm', '__lastWatirPrompt', '$chrome_asyncScriptInfo',
                                            '$cdc_asdjflasutopfhvcZLmcfl_']:
              ctx.log.info('Remove "{}" from {}.'.format(webdriver_key, flow.request.url))
              flow.response.text = flow.response.text.replace('"{}"'.format(webdriver_key), '"NO-SUCH-ATTR"')
       flow.response.text = flow.response.text.replace('t.webdriver', 'false')
       flow.response.text = flow.response.text.replace('FirefoxDriver', '')

注入cookie

简单注入

1 2	for cookie in cookies: self.browser.add_cookie({k: v for k, v in cookie.items()})

加载Profiles文件（火狐为例）

1	Cmd > firefox.exe – p C://1

找到默认的浏览器配置

1	\Users\Administrator\AppData\Roaming\Mozilla\Firefox\Profiles

模拟浏览器加载ua5yumsf.default文件夹

1	webdriver.FirefoxProfile(“xxx\firefoxprofile")

这个时候再模拟浏览器操作不再是空白

0x02 数据采集之接口

模拟浏览器请求接口数据，返回的json。但是请求过快出现Xsign验证。

破解验证方法：时间等待法。Xsign算法。

0x03 数据采集之验证

模拟浏览器模拟用户访问采集数据。

技巧：

不加载图片，减少响应速度
使用mimtproxy代理清除模拟浏览器属性
模拟浏览器加上证书
模拟浏览器设置代理
设置爬取时间间隔
设置超时等待

滑动验证码

用完mimtproxy去除模拟浏览器属性后，滑动通过率70%+

0x04 数据采集之存储

数据存储
Excle
Mysql
Mongodb

0x05 数据采集之架构设计

如何设计一个较好的采集架构？

发现存在问题：

单账号长时间采集出现访问频繁（小二滑动）
滑动验证存在通过率不高
出现频繁后更换cookie恢复正常
采集频率不能过快
能找到商品接口，但易失效出现小二

爬虫需求：

程序能全自动爬取
能自动登录淘宝账户
判断当前用户cookie是否失效
能检测验证存在并去滑动验证码
能在验证失败情况下复验证
能完成数据存储展示
尽量提高爬虫效率

方案一：单账号

模拟浏览器爬取，检测用户cookie，失效后或不存在清除cookie重新登录。到指定关键字或页面爬取。检测是否出现滑动验证，验证滑动。滑动失败进行再次滑动。获取到页面数据使用xpath解析数据、入库、展示。

方案二：多账号

使用多淘宝账号，设置用户数据库、cookie数据库。设计一个爬虫系统、一个登录系统。登录用户保存用户的cookie。将cookie加入到cookie池中。爬虫系统随机选择cookie用户数据爬取，保存数据。遇到cookie失效。将cookie对应的用户交给登录系统，登录系统重新登录获取新的cookie替换原来的cookie。爬虫系统选择使用api接口requests获取数据，模拟浏览器只做登录系统。这样爬取的效率大大增加。

代码实现：

https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/TaobaoCrawler