0%

写个某直播网站的视频爬虫

pyppeteer pyquery jsonloads

很喜欢一个up的龙珠解说视频,但是youtube-dl和you-get都不支持,就写了这个,不严谨,就是个思路

先爬取主播主页,遍历多页,等待元素加载,获取所有视频链接,上面说pyppeteer时说过,结果保存到文件中

parse.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# coding=utf-8
import time
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

urls =["https://v.huya.com/u/146501201/video.html?sort=news&p={}".format(i) for i in range(1,6)]

async def main():
down_list = []
browser = await launch()
page = await browser.newPage()
with open("/Users/ming/projects/huyaPyDwon/down.txt", 'a') as f:
for url in urls:
await page.goto(url)
await page.waitForSelector('.content-list .statpid')
doc = pq(await page.content())
pink_link = "https://v.huya.com"
names = [pink_link + item.attr('href') for item in doc('.content-list .statpid').items()]
for name in names:
f.write(name + '\n')

# print('Names:', names)
await browser.close()

asyncio.get_event_loop().run_until_complete(main())

遍历读取文件中的链接,调用视频接口,解析json数据,获取m3u8文件链接,最后调用 youtube_dl 下载

down.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# youtube_dl 接口函数

def down(data,name):
ydl_opts = {
'nooverwrites': True,
'ignoreerrors': True,
'retries': True,
'outtmpl': name,
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([data])

# 读取,解析,下载

with open("/Users/ming/projects/huyaPyDwon/down.txt" , 'r') as f:
list = f.readlines()
# print(list)
list = [i.strip()[24:33] for i in list]
list = [API_URL.format(i) for i in list]
for l in list:
data = json.loads(requests.get(l, headers=headers).text)
m3u8Link = data['data']['moment']['videoInfo']['definitions'][0]['m3u8']
name = data['data']['moment']['videoInfo']['videoTitle']
down(m3u8Link,name)

time.sleep(5)