人生苦短,我用Python-----爬取图片
人生苦短,我学python!
最近准备看看机会,看了好多的jd上,都要求会一点python,shell脚本,就在空闲的时间里面学习了一下,刚刚入门,还是一个小菜鸡,不过能写一两个小爬虫了,嘿嘿嘿
在这里给大家推荐一下我自学的网站,讲的很简单,,那就是廖雪峰大佬的博客,好东西就是分享.我的第一语言是java,学了这点python之后,我是真觉的 人生苦短,我用python! 说的是真对.
程序员大多都是很懒,python 会让你变得更懒,好多东西都已经封装好了,因一个包就能直接用,so easy!这篇文章先来分享一个我自己写的一个爬取图片的小程序,写的很烂,命名方面和java差很多,高抬贵嘴,莫喷
# -*- coding: UTF-8 -*- import requests, os, time, random from bs4 import BeautifulSoup from urllib.request import urlretrieve """ 爬取图片网站的demo http://www.shuaia.net/ """ headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" } params = {"tagname": "美女"} def get_pageurl(j, target_urls): url = r"http://www.shuaia.net/e/tags/index.php?page=%d&line=25&tempid=3" % (j) response = requests.get(url=url, headers=headers, params=params) if response.status_code != 200: return None print(response.url) response.encoding = utf-8 soup = BeautifulSoup(response.text, lxml) find_all = soup.find_all(class_=item-img) for item in find_all: target_urls.append(item.img.get(alt) + = + item.get(href)) return target_urls if __name__ == __main__: while True: j = 0 target_urls = [] target_urls = get_pageurl(j, target_urls) if None == target_urls: continue print(target_urls) j = j + 1 for item in target_urls: detail = item.split("=") fileName = detail[0] print(fileName) file_name = fileName + ".jpg" if fileName not in os.listdir(): os.makedirs(fileName) fileUrl = detail[1] print("下载 -》》》》" + fileName) response_img = requests.get(fileUrl) response_img.encoding = utf-8 html = response_img.text img_html = BeautifulSoup(html, lxml) html_find = img_html.find_all(div, class_=wr-single-content-list) img_bf_2 = BeautifulSoup(str(html_find), lxml) img_url = http://www.shuaia.net + img_bf_2.div.img.get(src) urlretrieve(url=img_url, filename=fileName + / + file_name) print(img_url) url_end = time.sleep(random.randint(0, 5)) fileUrl = fileUrl[0:len(fileUrl) - 5] i = 1 while True: url_end = _ + str(i + 1) + .html crl_file_url = fileUrl + url_end crl_response_img = requests.get(crl_file_url) if crl_response_img.status_code != 200: break crl_response_img.encoding = utf-8 crl_html = crl_response_img.text crl_img_html = BeautifulSoup(crl_html, lxml) crl_html_find_1 = crl_img_html.find_all(div, class_=wr-single-content-list) crl_img_bf_2_1 = BeautifulSoup(str(crl_html_find_1), lxml) crl_img_url = http://www.shuaia.net + crl_img_bf_2_1.div.img.get(src) urlretrieve(url=crl_img_url, filename=fileName + / + fileName + str(i + 1) + ".jpg") i = i + 1 time.sleep(random.randint(0, 5))
上一篇:
多线程四大经典案例
下一篇:
【Java】 JPA介绍