快速爬取 Sysadmin Casts 的博客节目

早上收到 Sysadmin Casts 发来邮件,作者由于工作发生变化,所有节目全部免费。之前由于支付不方便,一直没有购买 pro 版,很多期没看到,挺可惜的,这下可爽歪歪了。

Hey,

https://sysadmincasts.com/ here.

My personal situation has changed. Site is back on a "free for all" model. All PRO episodes unlocked. Happy watching!! https://sysadmincasts.com/

Cheers,
Justin

--

You are receiving this email because you opted in at my website https://sysadmincasts.com/

看的同时,我也在想,要是过段时间作者关站了节目没了不是很可惜?还是用老本行开个爬虫备份回来吧。 当然,本着每次都用新轮子的原则,这次我试验了下kennethreitz大神的新项目requests-html,不用 Scrapy,也不用 requests+ Beautiful Soup的组合,就能办到,不是挺爽?

想要爬取也很简单,直接点击“All Episodes”进入里面,翻页获取所有博客的内容页链接,再爬取内容即可。

话不多少,上代码:

# sysadmin_casts.py
from requests_html import HTMLSession  # requests_html
import json
import requests
import ipdb
import logging
logger = logging.getLogger(__name__)


class SysAdminCasts(object):

    def __init__(self):
        self.session = HTMLSession()

    def run(self):
        self.page_links = set()
        self.urls = set()
        self.asset_urls = set()
        self.visited_links = set()
        self.episodes = []
        first_url = 'https://sysadmincasts.com/episodes?page=0'
        while True:
            if not self.visited_links:
                url = first_url
            else:
                need_visit = self.page_links - self.visited_links
                if not len(need_visit):
                    break
                url = need_visit.pop()
            logger.debug(f'download links from {url}')
            self.visited_links.add(url)
            response = self.session.get(url)
            html = response.html
            self.page_links |= {_.absolute_links.pop() for _ in html.find('a.page-link')}
            self.urls |= {_.absolute_links.pop() for _ in html.find('.card a.mb-1')}
        list(map(self.download_episode, sorted(self.urls)))
        logger.debug('saving episodes')
        with open('episodes.json', 'w') as f:
            f.write(json.dumps(self.episodes))
        list(map(self.download_assets, self.asset_urls))

    def download_assets(self, url):
        pass

    def download_episode(self, url):
        logger.debug(f'download episode meta from {url}')
        post = {
            'url': url,
            'poster': '',
            'sources': [],
        }
        response = self.session.get(url)
        html = response.html
        try:
            post['title'] = html.find('h2.display-5')[0].text
            el_vjs = html.find('#my_video.video-js', first=True)
            if el_vjs:
                post['poster'] = el_vjs.attrs['poster']
                post['sources'] = [_.attrs['src'] for _ in el_vjs.find('source')]
            el_article = html.find('section.space-sm article', first=True)
            post['html_content'] = el_article.html
            post['text_content'] = el_article.text
            el_meta = html.find('.card ul.list-group-flush', first=True)
            post['meta'] = {_.find('div>div', first=True).text: _.find('div>span', first=True).text for _ in el_meta.find('li>div')}
        except Exception as e:
            logger.error(f"parse episode error {e}", exc_info=True)
            ipdb.set_trace()
        self.episodes.append(post)
        self.asset_urls |= set(post['sources'])
        self.asset_urls |= {_.attrs['src'] for _ in el_article.find('img')}


if __name__ == '__main__':
    formatter = logging.Formatter('%(asctime)s - %(name)s:%(lineno)d - %(levelname)s - %(message)s')
    logging.basicConfig(format=formatter._fmt, level=logging.DEBUG, filename=None)
    spider = SysAdminCasts()
    spider.run()

保存文件,安装依赖后运行

pip install requests_html ipdb
python3 sysadmin_casts.py

运行过程中如果其他问题会停止,慢慢排错,如果解析问题会启用 ipdb ,测试解析。 运行完毕后目录下会有 episodes.json 文件,里面就是所有的节目的内容元信息了。 当然我们的目的是所有的视频,别忙,在写一段代码提取下载地址用 wget 来下吧:

## episodes2downloads.py
from functools import reduce
import json
j = json.load(open('episodes.json'))


def add_episode_sources(old, new):
    old |= set(new.get('sources', []))
    return old

sources = set()
reduce(add_episode_sources, j, sources)
sources
with open('videos.txt', 'w') as f:
    f.write("\n".join(sources))

保存文件,运行

python3 episodes2downloads.py
wget -i videos.txt -o download.log -p ev

这样在 ev 目录下就有了所有的视频节目了。 当然这里是比较粗糙的一种做法,节目能看就够了,更进一步的就自己动手吧。

最后再次感谢作者 Justin Weissig 制作了这么多期优秀的博客节目。

zmoden with sz and rz

通常我们 ssh 连上服务器之后要相互之间传送文件的话会需要单独进行 scp 或者 rsync 乃至 ftp,其实不必,用 zmodem 可以更方便的进行。这里以在 MacOS 下为例:

阅读更多…