北大青鸟:你需要的python爬虫笔记 下载本文

内容发布更新时间 : 2024/12/23 21:18:04星期一 下面是文章的全部内容请认真阅读。

有时候我们的爬虫不一定只是爬取文本数据,也会爬取一些图片,下面就来看怎么将爬取的图片存到本地磁盘。

我们先来选好目标,知乎话题:女生怎么健身锻造好身材? (单纯因为图多,不要多想哦 (# _ # ) )

看下页面的源代码,找到话题下图片链接的格式,如图:

可以看到,图片在img标签中,且class=origin_image zh-lightbox-thumb,而且链接是由.jpg结尾,我们便可以用Beautiful Soup结合正则表达式的方式来提取所有链接,如下:

links = soup.find_all('img', \zh-lightbox-thumb\

提取出所有链接后,使用request.urlretrieve来将所有链接保存到本地

Copy a network object denoted by a URL to a local file. If the URL points to a local file, the object will not be copied unless filename is supplied. Return a tuple (filename,

headers)where filename is the local file name under which the object can be found, and headers is whatever the info()method of the object returned by urlopen()returned (for a remote object). Exceptions are the same as for urlopen(). 具体实现代码如下: # -*- coding:utf-8 -*- import time

from urllib import request from bs4 import BeautifulSoup import re

url = r'https://www.zhihu.com/question/22918070'

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

page = request.Request(url, headers=headers)

page_info = request.urlopen(page).read().decode('utf-8') soup = BeautifulSoup(page_info, 'html.parser')

# Beautiful Soup和正则表达式结合,提取出所有图片的链接(img标签中,class=**,以.jpg结尾的链接)

links = soup.find_all('img', \zh-lightbox-thumb\# 设置保存的路径,否则会保存到程序当前路径 local_path = r'E:Pic' for link in links: print(link.attrs['src'])

# 保存链接并命名,time防止命名冲突

request.urlretrieve(link.attrs['src'], local_path+r'%s.jpg' % time.time())