avatar

莫烦Python——爬虫

内容来自:https://morvanzhou.github.io/

爬虫简介

了解网页的结构

html,css,js

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# read读出网页源码
# 网页中有中文,用decode解码
print(html)

打印出来内容

<!DOCTYPE html>
<html lang="cn">
<head>
<meta charset="UTF-8">
<title>Scraping tutorial 1 | 莫烦Python</title>
<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
<h1>爬虫测试1</h1>
<p>
这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
</p>

</body>
</html>
# 用正则表达式匹配tag,筛选信息
import re
# 找到网页的title
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])
# Page title is:  Scraping tutorial 1 | 莫烦Python
# flags=re.DOTALL,使对tab、new line不敏感
# 如果去掉这个参数,则不能查找出所需内容
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL) # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])
# Page paragraph is:
# 这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
# <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.

如果直接print(res)

['\n\t\t这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>\n\t\t<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.\n\t']

猜测使直接打印list时,不进行转义
找到所有链接

res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)
['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

BeautifulSoup

英文官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

中文官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BeautifulSoup解析网页

基础

安装

pip install beautifulsoup4

简单使用方法

from bs4 import BeautifulSoup
from urllib.request import urlopen

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

打印出网页html

## 以lxml形式解析网页
soup = BeautifulSoup(html, features='lxml')
## 选择tag h1
print(soup.h1)
## 选择tag p
print(soup.p)
## 返回所有的链接
all_href=soup.find_all('a')
## 生成一个列表
all_href = [l['href'] for l in all_href]
print('\n', all_href)

打印出来一个list

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

解析CSS

CSS的Class


<!DOCTYPE html>
<html lang="cn">
<head>
<meta charset="UTF-8">
<title>爬虫练习 列表 class | 莫烦 Python</title>
<style>
.jan { -- class的信息
background-color: yellow;
}
.feb {
font-size: 25px;
}
.month {
color: red;
}
</style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/" >莫烦 Python</a><a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/" >爬虫教程</a>
里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
<li class="month">一月</li>
<ul class="jan">
<li>一月一号</li>
<li>一月二号</li>
<li>一月三号</li>
</ul>
<li class="feb month">二月</li>
<li class="month">三月</li>
<li class="month">四月</li>
<li class="month">五月</li>
</ul>

</body>
</html>
from bs4 import BeautifulSoup
from urllib.request import urlopen

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

## use class to narrow search
## 标签li中,class为中包含month的
month = soup.find_all('li', {"class": "month"})
for m in month:
## m为html形式
## m.get_text()为网页显示的文字
print(m.get_text())

输出结果

一月
二月
三月
四月
五月
## 嵌套查找
jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li') ## use jan as a parent
for d in d_jan:
print(d.get_text())

正则表达式

正则表达式教程:https://morvanzhou.github.io/tutorials/python-basic/basic/13-10-regular-expression/

找到所有图片形式的链接

<td>
<img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
</td>

挑选出jpg图片

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

## if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

## img标签中,src内容符合.jpg的内容
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])

## 找到开头为https://morvan的链接
course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
print(link['href'])

爬百度百科

爬这个网页:https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

从这个网页为起点,爬取并进入网页链接

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random


## 百度百科
base_url = "https://baike.baidu.com"
## “网络爬虫”词条
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

## -1是指选取list中的最后一个元素
url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
## 找到h1
print(soup.find('h1').get_text(), ' url: ', his[-1])
## taget="_blank"是指另外打开一个网页

## find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
## no valid sub link found
his.pop()
print(his)

循环爬取多个网页

his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find('h1').get_text(), ' url: ', his[-1])

## find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
## no valid sub link found
his.pop()

更多下载/请求方式

多功能的Requests

安装requests模块

pip install requests

requests get

import requests
import webbrowser
# ?wd=莫烦Python
param = {"wd": "莫烦Python"} # 搜索的信息
r = requests.get('http://www.baidu.com/s', params=param)
print(r.url)
# http://www.baidu.com/s?wd=%E8%8E%AB%E7%83%A6Python
# 用浏览器打开
webbrowser.open(r.url)

requests post

http://pythonscraping.com/pages/files/form.html

多功能的 Requests

data = {'firstname': '莫烦', 'lastname': '周'}
r = requests.post('http://pythonscraping.com/files/processing.php', data=data)
print(r.text)

# Hello there, 莫烦 周!

提交照片

http://pythonscraping.com/files/form2.html

file = {'uploadFile': open('./image.png', 'rb')}
r = requests.post('http://pythonscraping.com/files/processing2.php', files=file)
print(r.text)

# The file image.png has been uploaded.

登录

多功能的 Requests
payload = {'username': 'Morvan', 'password': 'password'}
r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
print(r.cookies.get_dict())

# {'username': 'Morvan', 'loggedin': '1'}


r = requests.get('http://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)
print(r.text)

# Hey Morvan! Looks like you're still logged into the site!

使用session登录

session = requests.Session()
payload = {'username': 'Morvan', 'password': 'password'}
r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
print(r.cookies.get_dict())

# {'username': 'Morvan', 'loggedin': '1'}


r = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(r.text)

# Hey Morvan! Looks like you're still logged into the site!

下载文件

inspect 查找文件链接

https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png

# 新建文件夹
import os
os.makedirs('./img/', exist_ok=True)

IMAGE_URL = "https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png"

urlretrieve

from urllib.request import urlretrieve
urlretrieve(IMAGE_URL, './img/image1.png')

requests

import requests
r = requests.get(IMAGE_URL)
# 把文件写进去
with open('./img/image2.png', 'wb') as f:
f.write(r.content)

如果文件很大,更适合下面这个方法

r = requests.get(IMAGE_URL, stream=True)    # stream loading

with open('./img/image3.png', 'wb') as f:
for chunk in r.iter_content(chunk_size=32):
f.write(chunk)

下载国家地理杂志图片

http://www.ngchina.com.cn/animals/

找到图片位置

å°ç»ƒä¹ : 下载美图
from bs4 import BeautifulSoup
import requests

URL = "http://www.nationalgeographic.com.cn/animals/"

html = requests.get(URL).text
soup = BeautifulSoup(html, 'lxml')
img_ul = soup.find_all('ul', {"class": "img_list"})

# 循环下载图片
for ul in img_ul:
imgs = ul.find_all('img')
for img in imgs:
url = img['src']
r = requests.get(url, stream=True)
image_name = url.split('/')[-1]
with open('./img/%s' % image_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
print('Saved %s' % image_name)

爬虫加速

多进程加速式

import multiprocessing as mp
import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re

# base_url = "http://127.0.0.1:4000/" # 内网
base_url = 'https://morvanzhou.github.io/'

def crawl(url):
response = urlopen(url)
# time.sleep(0.1) # slightly delay for downloading
return response.read().decode()


def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls]) # 去重
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url

unseen = set([base_url,]) # 去重
seen = set()

普通方法

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False

while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) >= 20:
break
htmls = [crawl(url) for url in unseen]
results = [parse(html) for html in htmls]

seen.update(unseen) # seen the crawled
unseen.clear() # nothing unseen

for title, page_urls, url in results:
unseen.update(page_urls - seen) # get new url to crawl

分布式方法

pool = mp.Pool(4) # 进程池
while len(unseen) != 0:
# htmls = [crawl(url) for url in unseen]
# --->
crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
htmls = [j.get() for j in crawl_jobs]

# results = [parse(html) for html in htmls]
# --->
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs]

异步加载 Asyncio

在单线程里使用异步计算,下载网页的时候和处理网页的时候是不连续的,更有效利用了等待下载的这段时间

以下代码适用于python3.5+

Asyncio示例

# 不是异步的
import time


def job(t):
print('Start job ', t)
time.sleep(t) # wait for "t" seconds
print('Job ', t, ' takes ', t, ' s')


def main():
[job(t) for t in range(1, 3)]


t1 = time.time()
main()
print("NO async total time : ", time.time() - t1)

"""
Start job 1
Job 1 takes 1 s
Start job 2
Job 2 takes 2 s
NO async total time : 3.008603096008301
"""
import asyncio


async def job(t): # async 形式的功能
print('Start job ', t)
await asyncio.sleep(t) # 等待 "t" 秒, 期间切换其他任务
print('Job ', t, ' takes ', t, ' s')


async def main(loop): # async 形式的功能
tasks = [
loop.create_task(job(t)) for t in range(1, 3)
] # 创建任务, 但是不执行
await asyncio.wait(tasks) # 执行并等待所有任务完成

t1 = time.time()
loop = asyncio.get_event_loop() # 建立 loop
loop.run_until_complete(main(loop)) # 执行 loop
loop.close() # 关闭 loop
print("Async total time : ", time.time() - t1)

"""
Start job 1
Start job 2
Job 1 takes 1 s
Job 2 takes 2 s
Async total time : 2.001495838165283
"""

Asyncio爬虫(aiohttp)

import requests

URL = 'https://morvanzhou.github.io/'


def normal():
for i in range(2):
r = requests.get(URL)
url = r.url
print(url)

t1 = time.time()
normal()
print("Normal total time:", time.time()-t1)

"""
https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.3869960308074951
"""
import aiohttp
import time
import asyncio
URL = 'https://morvanzhou.github.io/'
async def job(session):
response = await session.get(URL) # 等待拿到url时执行其它程序
return str(response.url)


async def main(loop):
async with aiohttp.ClientSession() as session: # 官网推荐建立 Session 的形式
tasks = [loop.create_task(job(session)) for _ in range(2)] # 创建任务但不执行
finished, unfinished = await asyncio.wait(tasks) # 执行等到所有任务完成
all_results = [r.result() for r in finished] # 获取所有结果
print(all_results)

t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print("Async total time:", time.time() - t1)

"""
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.11447715759277344
"""

和多线程的对比

åŠ é€Ÿçˆ¬è™«: å¼‚æ­¥åŠ è½½ Asyncio

Selenium

火狐插件 Katalon Recorder可以记录操作日志,并生成对应的python代码

具体操作参考https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/#%E5%81%B7%E6%87%92%E7%9A%84%E7%81%AB%E7%8B%90%E6%B5%8F%E8%A7%88%E5%99%A8%E6%8F%92%E4%BB%B6

安装浏览器驱动,放到置顶位置(可以参考博客中的《爬取机场数据》)

rom selenium import webdriver

driver = webdriver.Chrome() # 打开 Chrome 浏览器

# 将刚刚复制的帖在这
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页 html, 还能截图
html = driver.page_source # get html
driver.get_screenshot_as_file("./img/sreenshot1.png")
driver.close()
# 不弹出浏览器
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless") # define headless

driver = webdriver.Chrome(chrome_options=chrome_options)

Senlenium 教学官网:https://selenium-python.readthedocs.io/

Scrapy

官网:https://docs.scrapy.org/en/latest/

scrapy学习入门:https://www.jianshu.com/p/a8aad3bf4dc4

研究探索:https://blog.csdn.net/u012150179/article/details/32343635

import scrapy

class MofanSpider(scrapy.Spider):
name = "mofan"
start_urls = [
'https://morvanzhou.github.io/',
]
# unseen = set()
# seen = set() # 我们不在需要 set 了, 它自动去重

class MofanSpider(scrapy.Spider):
...
def parse(self, response):
yield { # return some results
'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
'url': response.url,
}

urls = response.css('a::attr(href)').re(r'^/.+?/$') # find all sub urls
for url in urls:
yield response.follow(url, callback=self.parse) # it will filter duplication automatically
# yield 异步处理

在terminal中

scrapy runspider 5-2-scrapy.py -o res.json
Author: Michelle19l
Link: https://gitee.com/michelle19l/michelle19l/2020/07/14/python/爬虫/爬虫简介/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • 微信
    微信
  • 支付寶
    支付寶