写个脚本归档一下缠师说禅的博客

不是本ID自吹自擂：能让你八卦的博客，不能让你有品位；能让有品位的，不能让你智慧；能让你智慧的，不能让你挣钱；能让你挣钱的，不能让你明心。而能让你八卦、品位、智慧、挣钱、明心，一个都不少的博客，全球只有一个，那就是：全球第一博客--缠中说禅

世间能有的当代超级牛人不多，缠师算一个，牛人的定义有很多种，但缠在很多评价体系来看都是牛人，确实不像个人类，例如缠师在05年就提到要带路，和18年的一带一路，如出一辙。

已经归档的内容可以直接在：点击去下载

进行下载，里面有txt和pdf两个版本的内容。

按照博客描述，缠师因为癌症英年早逝，止步于2008年，关于缠师的身份一直是一个迷，写了个脚本归档了一下缠师之前的博客：

import requests
from bs4 import BeautifulSoup

text_file_name = "files/缠中说禅.txt"
md_file_name = "files/缠中说禅.md"
markdown_template = "files/markdown/template.md"
markdown_post_dir = "files/markdown/posts/"


def write_to_md_file(context):
    with open(md_file_name, "a+") as f:
        f.write(context)
        f.flush()


def write_to_text(context):
    with open(text_file_name, "a+") as f:
        f.write(context)
        f.flush()


def write_to_markdown_post(article_body_url, title_name, category, create_time, content_body):
    content_body = '。\n'.join(content_body.split("。"))
    title_name = title_name.replace("/", "-")
    with open(markdown_template) as f:
        lines = f.readlines()
        context_template = ''.join(lines)
        full_content_body = context_template.format(title_name,
                                                    category,
                                                    category,
                                                    title_name,
                                                    title_name,
                                                    create_time,
                                                    article_body_url,
                                                    content_body)

    with open(markdown_post_dir + title_name + ".md", "w") as f:
        f.write(full_content_body)
        f.flush()


def download_article_body(article_body_url):
    r = requests.get(article_body_url)
    soup = BeautifulSoup(str(r.content, 'utf-8'))
    title_name = soup.find(class_="articalTitle").find(class_="titName SG_txta").text.strip()
    create_time = soup.find(class_="articalTitle").find(class_="time SG_txtc").text.strip()
    category = "无分类"
    if soup.find(class_="blog_class").find('a') is not None:
        category = soup.find(class_="blog_class").find('a').text.strip()
    content_body = soup.find(class_="articalContent").text.strip()

    full_context = "\n{} - {} - {} \n{} \n".format(title_name, category, create_time, content_body)
    write_to_text(full_context)
    content_body_md = '。\n'.join(content_body.split("。"))
    md_full_context = "\n## {} \n### {} \n### {} \n {}".format(title_name, category, create_time, content_body_md)
    write_to_md_file(md_full_context)
    write_to_markdown_post(article_body_url, title_name, category, create_time, content_body)


def worker(page_url):
    print("Starting process:", page_url)
    r = requests.get(page_url)
    r.raise_for_status()
    soup = BeautifulSoup(str(r.content, 'utf-8'))
    for article_list in soup.find_all(class_="atc_title"):
        article_body_url = article_list.find('a', href=True).attrs['href']
        download_article_body(article_body_url)

    if soup.find(class_="SG_pgprev") is None:
        print("This is first page.")

    if soup.find(class_="SG_pgnext") is not None:
        next_url = soup.find(class_="SG_pgnext").find('a', href=True).attrs['href']
        print("Fetch the next page:", next_url)
        worker(next_url)
    else:
        print("The is the last page.")
    print("End the worker:", page_url)


if __name__ == '__main__':
    url = "http://blog.sina.com.cn/s/articlelist_1215172700_0_1.html"
    worker(url)
    print("All the worker is done.")

template.md是我自己的博客模板，如下格式：

脚本会把内容归档为多种格式：txt, pdf 和 hexo可使用的博客文章。

扫码手机观看或分享：

惊帆的BLOG

关于我

写个脚本归档一下缠师说禅的博客