内容来自：https://morvanzhou.github.io/

爬虫简介

了解网页的结构

html，css，js

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8') 
# read读出网页源码
# 网页中有中文，用decode解码
print(html)

打印出来内容

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>

# 用正则表达式匹配tag，筛选信息
import re
# 找到网页的title
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

# Page title is:  Scraping tutorial 1 | 莫烦Python

# flags=re.DOTALL，使对tab、new line不敏感
# 如果去掉这个参数，则不能查找出所需内容
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])

# Page paragraph is:
#  这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
#  <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.

如果直接print(res)

['\n\t\t这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>\n\t\t<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.\n\t']

猜测使直接打印list时，不进行转义
找到所有链接

res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)

['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

BeautifulSoup

英文官网：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

中文官网：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BeautifulSoup解析网页

基础

安装

pip install beautifulsoup4

简单使用方法

from bs4 import BeautifulSoup
from urllib.request import urlopen

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

打印出网页html

## 以lxml形式解析网页
soup = BeautifulSoup(html, features='lxml')
## 选择tag h1
print(soup.h1)
## 选择tag p
print(soup.p)
## 返回所有的链接
all_href=soup.find_all('a')
## 生成一个列表
all_href = [l['href'] for l in all_href]
print('\n', all_href)

打印出来一个list

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

解析CSS

CSS的Class


<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>爬虫练习 列表 class | 莫烦 Python</title>
	<style>
	.jan { -- class的信息
		background-color: yellow;
	}
	.feb {
		font-size: 25px;
	}
	.month {
		color: red;
	}
	</style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/" >莫烦 Python</a> 的 <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/" >爬虫教程</a>
	里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
	<li class="month">一月</li>
	<ul class="jan">
		<li>一月一号</li>
		<li>一月二号</li>
		<li>一月三号</li>
	</ul>
	<li class="feb month">二月</li>
	<li class="month">三月</li>
	<li class="month">四月</li>
	<li class="month">五月</li>
</ul>

</body>
</html>

from bs4 import BeautifulSoup
from urllib.request import urlopen

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

## use class to narrow search
## 标签li中，class为中包含month的
month = soup.find_all('li', {"class": "month"})
for m in month:
    ## m为html形式
    ## m.get_text()为网页显示的文字
    print(m.get_text())

输出结果

一月
二月
三月
四月
五月

## 嵌套查找
jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li')              ## use jan as a parent
for d in d_jan:
    print(d.get_text())

正则表达式

正则表达式教程：https://morvanzhou.github.io/tutorials/python-basic/basic/13-10-regular-expression/

找到所有图片形式的链接

如

<td>
    <img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
</td>

挑选出jpg图片

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

## img标签中，src内容符合.jpg的内容
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])
    
## 找到开头为https://morvan的链接
course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
    print(link['href'])

爬百度百科

爬这个网页：https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

从这个网页为起点，爬取并进入网页链接

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random


## 百度百科
base_url = "https://baike.baidu.com"
## “网络爬虫”词条
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

## -1是指选取list中的最后一个元素
url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
## 找到h1
print(soup.find('h1').get_text(), '    url: ', his[-1])
## taget="_blank"是指另外打开一个网页

## find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
    his.append(random.sample(sub_urls, 1)[0]['href'])
else:
    ## no valid sub link found
    his.pop()
print(his)

循环爬取多个网页

his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    ## find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        ## no valid sub link found
        his.pop()

爬虫加速

多进程加速式

import multiprocessing as mp
import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re

# base_url = "http://127.0.0.1:4000/" # 内网
base_url = 'https://morvanzhou.github.io/'

def crawl(url):
    response = urlopen(url)
    # time.sleep(0.1)             # slightly delay for downloading
    return response.read().decode()


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])   # 去重
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url

unseen = set([base_url,]) # 去重
seen = set()

普通方法

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False

while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) >= 20:
        break
    htmls = [crawl(url) for url in unseen]
    results = [parse(html) for html in htmls]

    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        unseen.update(page_urls - seen)     # get new url to crawl

分布式方法

pool = mp.Pool(4) # 进程池
while len(unseen) != 0:
    # htmls = [crawl(url) for url in unseen]
    # --->
    crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
    htmls = [j.get() for j in crawl_jobs]

    # results = [parse(html) for html in htmls]
    # --->
    parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
    results = [j.get() for j in parse_jobs]

异步加载 Asyncio

在单线程里使用异步计算，下载网页的时候和处理网页的时候是不连续的，更有效利用了等待下载的这段时间

以下代码适用于python3.5+

Asyncio示例

# 不是异步的
import time


def job(t):
    print('Start job ', t)
    time.sleep(t)               # wait for "t" seconds
    print('Job ', t, ' takes ', t, ' s')


def main():
    [job(t) for t in range(1, 3)]


t1 = time.time()
main()
print("NO async total time : ", time.time() - t1)

"""
Start job  1
Job  1  takes  1  s
Start job  2
Job  2  takes  2  s
NO async total time :  3.008603096008301
"""

import asyncio


async def job(t):                   # async 形式的功能
    print('Start job ', t)
    await asyncio.sleep(t)          # 等待 "t" 秒, 期间切换其他任务
    print('Job ', t, ' takes ', t, ' s')


async def main(loop):                       # async 形式的功能
    tasks = [
    loop.create_task(job(t)) for t in range(1, 3)
    ]                                       # 创建任务, 但是不执行
    await asyncio.wait(tasks)               # 执行并等待所有任务完成

t1 = time.time()
loop = asyncio.get_event_loop()             # 建立 loop
loop.run_until_complete(main(loop))         # 执行 loop
loop.close()                                # 关闭 loop
print("Async total time : ", time.time() - t1)

"""
Start job  1
Start job  2
Job  1  takes  1  s
Job  2  takes  2  s
Async total time :  2.001495838165283
"""

Asyncio爬虫(aiohttp)

import requests

URL = 'https://morvanzhou.github.io/'


def normal():
    for i in range(2):
        r = requests.get(URL)
        url = r.url
        print(url)

t1 = time.time()
normal()
print("Normal total time:", time.time()-t1)

"""
https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.3869960308074951
"""

import aiohttp
import time
import asyncio
URL = 'https://morvanzhou.github.io/'
async def job(session):
    response = await session.get(URL)       # 等待拿到url时执行其它程序
    return str(response.url)


async def main(loop):
    async with aiohttp.ClientSession() as session:      # 官网推荐建立 Session 的形式
        tasks = [loop.create_task(job(session)) for _ in range(2)] # 创建任务但不执行
        finished, unfinished = await asyncio.wait(tasks) # 执行等到所有任务完成
        all_results = [r.result() for r in finished]    # 获取所有结果
        print(all_results)

t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print("Async total time:", time.time() - t1)

"""
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.11447715759277344
"""

和多线程的对比

Selenium

火狐插件 Katalon Recorder可以记录操作日志，并生成对应的python代码

具体操作参考https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/#%E5%81%B7%E6%87%92%E7%9A%84%E7%81%AB%E7%8B%90%E6%B5%8F%E8%A7%88%E5%99%A8%E6%8F%92%E4%BB%B6

安装浏览器驱动，放到置顶位置（可以参考博客中的《爬取机场数据》）

rom selenium import webdriver

driver = webdriver.Chrome()     # 打开 Chrome 浏览器

# 将刚刚复制的帖在这
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页 html, 还能截图
html = driver.page_source       # get html
driver.get_screenshot_as_file("./img/sreenshot1.png")
driver.close()

# 不弹出浏览器
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")       # define headless

driver = webdriver.Chrome(chrome_options=chrome_options)

Senlenium 教学官网：https://selenium-python.readthedocs.io/

Scrapy

官网：https://docs.scrapy.org/en/latest/

scrapy学习入门：https://www.jianshu.com/p/a8aad3bf4dc4

研究探索：https://blog.csdn.net/u012150179/article/details/32343635

import scrapy

class MofanSpider(scrapy.Spider):
    name = "mofan"
    start_urls = [
        'https://morvanzhou.github.io/',
    ]
    # unseen = set()
    # seen = set()      # 我们不在需要 set 了, 它自动去重
    
class MofanSpider(scrapy.Spider):
    ...
    def parse(self, response):
        yield {     # return some results
            'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
            'url': response.url,
        }

        urls = response.css('a::attr(href)').re(r'^/.+?/$')     # find all sub urls
        for url in urls:
            yield response.follow(url, callback=self.parse)     # it will filter duplication automatically
            # yield 异步处理

在terminal中

scrapy runspider 5-2-scrapy.py -o res.json

Author: Michelle19l

Link: https://gitee.com/michelle19l/michelle19l/2020/07/14/python/爬虫/爬虫简介/

爬虫

Donate

微信
支付寶

Recommend

2020-07-13

机场到港航班数据爬取

爬虫简介

了解网页的结构

BeautifulSoup解析网页

基础

安装

简单使用方法

解析CSS

CSS的Class

正则表达式

爬百度百科

更多下载/请求方式

多功能的Requests

安装requests模块

requests get

requests post

登录

使用session登录

下载文件

urlretrieve

requests

下载国家地理杂志图片