"偶尔论坛的标题内容，在搜索的时候，往往搜索不到想要的结果，只有完全匹配才能找到目标帖子。所以本地存储一份手动实时更新的论坛标题内容，能够方便了解帖子标题进而方便搜索目标帖子。以下是简单爬虫内容爬取整个论坛帖子标题，代码可复用，直接运行即 ...."

lisy

Rpa 19 号会员
函数编码设计 • 3 回帖 • 729 浏览 • 2019-09-06 00:09:43

简单论坛帖子标题爬取

偶尔论坛的标题内容，在搜索的时候，往往搜索不到想要的结果，只有完全匹配才能找到目标帖子。
所以本地存储一份手动实时更新的论坛标题内容，能够方便了解帖子标题进而方便搜索目标帖子。

以下是简单爬虫内容爬取整个论坛帖子标题，代码可复用，直接运行即可将所有论坛标题存储在本地：

import requests

from lxml import etree



class RpaBbs:

    start_url = 'http://support.i-search.com.cn/recent?p=1'
    # 解析网页标题
    def parse_topic(self, url):

        try:

            response = requests.get(url)

            content = response.text

            selector = etree.HTML(content)

            topics = selector.xpath("//ul/li/h2/a/text()")

            if topics != []:

                with open(r"path", mode="a") as f:   # 将path修改到自身希望保存路径

                    for x in topics:

                        f.write(x)

                    print("done")

            else:

                return print("nothing to write")

        except Exception as e:

            print(e)

    
    # 确定论坛页码
    def findMaxPage(self):

        response = requests.get(self.start_url)

        content = response.text

        selector = etree.HTML(content)

        maxPage = selector.xpath('//*[@class="pagination"]/a[10]/text()')[0][:2]

        return int(maxPage)

    # 生成爬取链接
    def generate_url(self):

        for x in range(1, self.findMaxPage()+1):

            yield 'http://support.i-search.com.cn/recent?p={}'.format(x)

    

    

if __name__ == "__main__":

    aim = RpaBbs()

    for x in aim.generate_url():

        aim.parse_topic(x)

有对于论坛爬取更多需求，随时讨论。