微博评论转发采集

一直困扰我的微博爬虫终于集齐了大部分的程序。

说到微博采集,不得不提50页的限制,一般的web端接口只提供50页的数据量。这就让人很是头疼。

耐下性子又找了几个例子,看到是使用移动端的接口获取评论与转发的数据的。

我们可以很轻易的使用github上的项目去采集微博的博文。但是评论的采集没有给出好的建议。本文记录下评论采集的核心知识。

微博移动端的数据链接长成这样https://m.weibo.cn/detail/4474368756963230,web端的连接长成这样https://weibo.com/1891858172/IvcFBtdsa,cn域名的链接长这样https://weibo.cn/comment/Iqq5GC1JF

首先我们获取的链接都是这样的IvcFBtdsa,然而程序所用的链接是这样的4474368756963230。两者展示都是同一个东西。想办法将这些信息转换成数字,首先尝试在web端的页面中复制数字找到这些数据有没有在页面中,然后解析页面。找了很久还是很麻烦。然后搜到文章说微博的id转化是10进制与62进制的转化。62进制就是10用小写字母a。36用大写字母A表示。一直到61为大写字母Z

首先分组蓝色字串 ,从后往前4个字符一组,得到以下三组字符:IvcFBtdsa

将它们分别转换成62进制的数值则为 35, 2061702, 8999724 将它们组合起来就是一串 3520617028999724 类似这样的字串。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__author__ = 'AJay'
__mtime__ = '2020/2/25 0025'

"""
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def base62_encode(num, alphabet=ALPHABET):
"""Encode a number in Base X

`num`: The number to encode
`alphabet`: The alphabet to use for encoding
"""
if (num == 0):
return alphabet[0]
arr = []
base = len(alphabet)
while num:
rem = num % base
num = num // base
arr.append(alphabet[rem])
arr.reverse()
return ''.join(arr)

def base62_decode(string, alphabet=ALPHABET):
"""Decode a Base X encoded string into the number

Arguments:
- `string`: The encoded string
- `alphabet`: The alphabet to use for encoding
"""
base = len(alphabet)
strlen = len(string)
num = 0

idx = 0
for char in string:
power = (strlen - (idx + 1))
num += alphabet.index(char) * (base ** power)
idx += 1

return num


def url_to_mid(url):


url = str(url)[::-1]
size = len(url) / 4 if len(url) % 4 == 0 else len(url) / 4 + 1
result = []
for i in range(int(size)):
s = url[i * 4: (i + 1) * 4][::-1]
s = str(base62_decode(str(s)))
s_len = len(s)
if i < size - 1 and s_len < 7:
s = (7 - s_len) * '0' + s
result.append(s)
result.reverse()
return int(''.join(result))

def mid_to_url(midint):

midint = str(midint)[::-1]
size = len(midint) / 7 if len(midint) % 7 == 0 else len(midint) / 7 + 1
result = []
for i in range(int(size)):
s = midint[i * 7: (i + 1) * 7][::-1]
s = base62_encode(int(s))
s_len = len(s)
if i < size - 1 and len(s) < 4:
s = '0' * (4 - s_len) + s
result.append(s)
result.reverse()
return ''.join(result)

if __name__ == '__main__':
print(url_to_mid('IvcFBtdsa'))
print(mid_to_url('4474368756963230'))

#4474368756963230
#000IvcFBtdsa

搞定这些之后,我们可以批量抓换了。接下来分析移动端的采集。

移动端评论采集

评论采集的过程稍微麻烦些。f12后看到ajax的请求,结果里面是评论的内容,主要数据在data中。

请求的链接https://m.weibo.cn/comments/hotflow?id=4461824747382445&mid=4461824747382445&max_id=&max_id=303080471307086&max_id_type=0。将这个链接重放到浏览器中显示的竟然是空数据。然后向下翻页,对比数据改变的数据为max_id的内容。结果是:链接不能重放,标识的内容为max_id决定。

max_id是从哪里来的呢?仔细看返回数据的内容存在max_id的字段。证实:本次的max_id为上次想用的数据中。也就是只能重放一次失效。

在实际运用过程,加入出现了第5页的数据,再去请求一次第二页的数据是没必要的,也是不合理的。第二页的数据都已经加载在网站中了。重新刷新页面会出现第一次请求。也会得到下一页(第二页)数据的max_id。猜想第一页的数据可以无限次的使用,但是每次返回的max_id不会重复。

大概的思路如此:

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__author__ = 'AJay'
__mtime__ = '2020/2/25 0025'

"""

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__author__ = 'AJay'
__mtime__ = '2020/2/25 0025'

"""

import random
import time

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook

from tools.mid_to_url import url_to_mid

# 要爬取热评的起始url
url = 'https://m.weibo.cn/comments/hotflow?id={id}&mid={mid}&max_id='
headers = {
'Cookie': 'cookie',
'Referer': 'https://m.weibo.cn/detail/4281013208904762',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

# 存为csv
wb = Workbook()
ws = wb.active


def get_page(max_id, id_type,mid):
# 参数
params = {
'max_id': max_id,
'max_id_type': id_type
}
try:
r = requests.get(url.format(id=mid,mid=mid), params=params, headers=headers)
print(r.url)
if r.status_code == 200:
return r.json()
except requests.ConnectionError as e:
print('error', e.args)


def parse_page(jsondata):
if jsondata:
items = jsondata.get('data')
item_max_id = {}
item_max_id['max_id'] = items['max_id']
item_max_id['max_id_type'] = items['max_id_type']
item_max_id['max'] = items['max']
return item_max_id


def write_csv(jsondata):
datas = jsondata.get('data').get('data')
for data in datas:
created_at = data.get("created_at")
like_count = data.get("like_count")
source = data.get("source")
floor_number = data.get("floor_number")
username = data.get("user").get("screen_name")
comment = data.get("text")
comment = BeautifulSoup(comment, 'html.parser').get_text()
print('当前楼层{},评论{}'.format(floor_number,comment))
# print jsondata.dumps(comment, encoding="UTF-8", ensure_ascii=False)
ws.append([username, created_at, like_count, floor_number, source,comment])



def run(url):
# 输入来自微博采集的博客url链接,所有评论保存对应链接的xlsx

m_id = 0
id_type = 0
mid=url_to_mid(url=url)
jsondata = get_page(m_id, id_type,mid=mid)
results = parse_page(jsondata)
maxpage = results['max']
for page in range(maxpage):
print('采集第{}页的微博'.format(page))
jsondata = get_page(m_id, id_type,mid)
print(jsondata)
write_csv(jsondata)
results = parse_page(jsondata)
time.sleep(random.randint(2,5))
if page%30==0:
time.sleep(60)
m_id = results['max_id']
id_type = results['max_id_type']
wb.save('xlsx/{}.xlsx'.format(url))


if __name__ == '__main__':

with open('gugong.txt','r+')as f:
for i in f.read().split('\n'):
print(i)

run(url=i)

微博转发内容的采集

相对而言转发的内容就比较容易获取,依然是移动端数据的抓包。
https://m.weibo.cn/api/statuses/repostTimeline传入两个值:idpage。总page数值也在第一页的data数据中存在。这种采集没有神马可以说得了。

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__author__ = 'AJay'
__mtime__ = '2020/2/25 0025'

"""

import random
import time

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook

from tools.mid_to_url import url_to_mid

# 要爬取热评的起始url
url = 'https://m.weibo.cn/api/statuses/repostTimeline'
headers = {
'Cookie': 'cookie',
'Referer': 'https://m.weibo.cn/detail/4281013208904762',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

# 存为csv
wb = Workbook()
ws = wb.active


def get_page(id, page):
# 参数
params = {
'id': id,
'page': page
}
try:
r = requests.get(url, params=params, headers=headers)
if r.status_code == 200:
return r.json()
except requests.ConnectionError as e:
print('error', e.args)


def write_csv(jsondata):
datas = jsondata.get('data').get('data')
for data in datas:
created_at = data.get("created_at")
source = data.get("source")
username = data.get("user").get("screen_name")
comment = data.get("text")
comment = BeautifulSoup(comment, 'html.parser').get_text()
# print jsondata.dumps(comment, encoding="UTF-8", ensure_ascii=False)
ws.append([username, created_at, source, comment ])

def run(url):
maxpage = 1000
mid = url_to_mid(url=url)
for page in range(1, maxpage):
print(page)
jsondata = get_page(mid, page)
try:
write_csv(jsondata)
except:
break
time.sleep(random.randint(2, 5))
if page % 30 == 0:
time.sleep(60)
wb.save('xlsx/zhuanfa/{}.xlsx'.format(url))

if __name__ == '__main__':
with open('gugong.txt', 'r+')as f:
for i in f.read().split('\n'):
print(i)
run(url=i)