爬取百度百科

深度跳转网页,有时候会报错,不知道为什么

  • 导入需要的包
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
  • 导入网址
base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
  • lxml解析,find选择匹配结果
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features = "lxml")
print(soup.find('a').get_text(), "url:", his[-1])

  • 正则表达式匹配筛选内容
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
    his.append(random.sample(sub_urls, 1)[0]['href'])
else:
    his.pop()
print(his)
  • dfs跳转网页
base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]
    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print('第', i, '次访问', soup.find('h1').get_text(), 'url:', his[-1])

    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])#[0]表示样本列表中要抽取元素的下标
    else:
        his.pop()

  • 爬取结果
0 网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 用户标识 url: /item/%E7%94%A8%E6%88%B7%E6%A0%87%E8%AF%86
2 义项 url: /item/%E4%B9%89%E9%A1%B9
3 小提琴协奏曲 url: /item/%E5%B0%8F%E6%8F%90%E7%90%B4%E5%8D%8F%E5%A5%8F%E6%9B%B2
4 彼得·伊里奇·柴可夫斯基 url: /item/%E6%9F%B4%E7%A7%91%E5%A4%AB%E6%96%AF%E5%9F%BA
5 作曲家 url: /item/%E4%BD%9C%E6%9B%B2%E5%AE%B6
6 义项 url: /item/%E4%B9%89%E9%A1%B9
7 李健 url: /item/%E6%9D%8E%E5%81%A5
8 三月的一整月 url: /item/%E4%B8%89%E6%9C%88%E7%9A%84%E4%B8%80%E6%95%B4%E6%9C%88
9 快乐阳光 url: /item/%E5%BF%AB%E4%B9%90%E9%98%B3%E5%85%89