快速爬取 Sysadmin Casts 的博客节目
早上收到 Sysadmin Casts 发来邮件,作者由于工作发生变化,所有节目全部免费。之前由于支付不方便,一直没有购买 pro 版,很多期没看到,挺可惜的,这下可爽歪歪了。
Hey, https://sysadmincasts.com/ here. My personal situation has changed. Site is back on a "free for all" model. All PRO episodes unlocked. Happy watching!! https://sysadmincasts.com/ Cheers, Justin -- You are receiving this email because you opted in at my website https://sysadmincasts.com/
看的同时,我也在想,要是过段时间作者关站了节目没了不是很可惜?还是用老本行开个爬虫备份回来吧。
当然,本着每次都用新轮子的原则,这次我试验了下kennethreitz
大神的新项目requests-html,不用
Scrapy,也不用 requests+ Beautiful
Soup的组合,就能办到,不是挺爽?
想要爬取也很简单,直接点击“All Episodes”进入里面,翻页获取所有博客的内容页链接,再爬取内容即可。
话不多少,上代码:
# sysadmin_casts.py from requests_html import HTMLSession # requests_html import json import requests import ipdb import logging logger = logging.getLogger(__name__) class SysAdminCasts(object): def __init__(self): self.session = HTMLSession() def run(self): self.page_links = set() self.urls = set() self.asset_urls = set() self.visited_links = set() self.episodes = [] first_url = 'https://sysadmincasts.com/episodes?page=0' while True: if not self.visited_links: url = first_url else: need_visit = self.page_links - self.visited_links if not len(need_visit): break url = need_visit.pop() logger.debug(f'download links from {url}') self.visited_links.add(url) response = self.session.get(url) html = response.html self.page_links |= {_.absolute_links.pop() for _ in html.find('a.page-link')} self.urls |= {_.absolute_links.pop() for _ in html.find('.card a.mb-1')} list(map(self.download_episode, sorted(self.urls))) logger.debug('saving episodes') with open('episodes.json', 'w') as f: f.write(json.dumps(self.episodes)) list(map(self.download_assets, self.asset_urls)) def download_assets(self, url): pass def download_episode(self, url): logger.debug(f'download episode meta from {url}') post = { 'url': url, 'poster': '', 'sources': [], } response = self.session.get(url) html = response.html try: post['title'] = html.find('h2.display-5')[0].text el_vjs = html.find('#my_video.video-js', first=True) if el_vjs: post['poster'] = el_vjs.attrs['poster'] post['sources'] = [_.attrs['src'] for _ in el_vjs.find('source')] el_article = html.find('section.space-sm article', first=True) post['html_content'] = el_article.html post['text_content'] = el_article.text el_meta = html.find('.card ul.list-group-flush', first=True) post['meta'] = {_.find('div>div', first=True).text: _.find('div>span', first=True).text for _ in el_meta.find('li>div')} except Exception as e: logger.error(f"parse episode error {e}", exc_info=True) ipdb.set_trace() self.episodes.append(post) self.asset_urls |= set(post['sources']) self.asset_urls |= {_.attrs['src'] for _ in el_article.find('img')} if __name__ == '__main__': formatter = logging.Formatter('%(asctime)s - %(name)s:%(lineno)d - %(levelname)s - %(message)s') logging.basicConfig(format=formatter._fmt, level=logging.DEBUG, filename=None) spider = SysAdminCasts() spider.run()
保存文件,安装依赖后运行
pip install requests_html ipdb python3 sysadmin_casts.py
运行过程中如果其他问题会停止,慢慢排错,如果解析问题会启用 ipdb ,测试解析。 运行完毕后目录下会有 episodes.json 文件,里面就是所有的节目的内容元信息了。 当然我们的目的是所有的视频,别忙,在写一段代码提取下载地址用 wget 来下吧:
## episodes2downloads.py from functools import reduce import json j = json.load(open('episodes.json')) def add_episode_sources(old, new): old |= set(new.get('sources', [])) return old sources = set() reduce(add_episode_sources, j, sources) sources with open('videos.txt', 'w') as f: f.write("\n".join(sources))
保存文件,运行
python3 episodes2downloads.py wget -i videos.txt -o download.log -p ev
这样在 ev 目录下就有了所有的视频节目了。 当然这里是比较粗糙的一种做法,节目能看就够了,更进一步的就自己动手吧。
最后再次感谢作者 Justin Weissig 制作了这么多期优秀的博客节目。