爬虫简介
-爬取数据(requests,urllib,urllib2模拟发请求)(selenium模块:模拟控制浏览器行为)
-解析数据:BeautifulSoup:解析html(css选择,xpath选择),re模块
-存储数据:文件,excel,mysql,redis,mongodb
-数据分析
request介绍
爬取短视频
GET请求和POST请求区别:
GET请求:
1.HTTP默认的请求方式就是GET
2.没有请求体
3.数据必须在1K之内
4.GET请求的数据会暴露在浏览器的地址栏中
GET请求的常用的操作:
1.在浏览器的地址栏中直接给出URL,那么一定是GET请求
2.点击页面上的超链接也一定是GET请求
3.提交表单时,表单默认使用GET请求,但可以设置POST
POST请求:
1.数据不会出现再地址栏中
2.数据大小没有上限
3.有请求体
4.请求体中如果存在中文,会使用URL编码
requests.post()用法与requests.get()完全一致,特殊的是requests.post()有一个data参数,用来存放请求体数据
hearders中通常需要传的值有:
User-Agent:
Host:
Referer:上一个请求的地址
Referer:https://www.lagou.com/gongsi
#安装requests:pip3 install requests 或者pycharm直接安装
request模块介绍
-get请求
-携带参数:param
-携带头:headers
-携带cookie:cookies
配置内容:
【1】.基本请求
【2】.一般需要携带头部:User-Agent来模拟是浏览器发送出去的请求
【3】.带参数的GET请求->cookies
【4】.单线程爬取短视频
【5】.多线程下载
import requests
#【1】.基本请求
response=requests.get('https://www.baidu.com')
#请求成功会有状态码200代表成功
if response.status_code==200:
#响应内容,字符串格式
print(response.text)
#byte格式的数据
#response.content
with open('test.html','wb') as f:
f.write(response.content)
#【2】.一般需要携带头部:User-Agent来模拟是浏览器发送出去的请求 search=input('请输入需要搜索的内容') #s?wd=%s不建议这么做,因为会涉及到编码问题 # url='https://www.baidu.com/s?wd=%s' %search url='https://www.baidu.com/s' print(url) response=requests.get( url=url, headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }, #requests模块内部自动会把参数拼到请求后方 params={'wd':'linux'}, #因为cookies比较特殊,经常需要带,所以它单独有一个参数 cookies={}, ) #请求成功会有状态码200代表成功 if response.status_code==200: #响应内容,字符串格式 print(response.text) #byte格式的数据 #response.content with open('search.html','wb') as f: f.write(response.content)
# 【3】.带参数的GET请求->cookies
response =
requests.get(url='https://github.com/settings/emails',
headers={
'User-Agent': 'Mozilla/5.0
(Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/71.0.3578.98 Safari/537.36'
},
cookies={'user_session': 'qnmaFqNVfNiQWl_vufpxn63he5EMhCzFSJrkIqmUQItDjLU1'}
)
if '200890836@qq.com' in response.text:
print('登陆了')
else:
print('看不到这页面')
# 【4】.单线程爬取短视频 import re import time # https://www.pearvideo.com/category_1 def get_page(url): try: response = requests.get( url=url, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }, ) if response.status_code == 200: return response.text except Exception as e: print(e) def parse_index(index_detail): # re.S忽略换行符,第一个.*?任意匹配,第二个.*?把匹配的值取出 res_list = re.findall('<li class="categoryem">.*?<a href="(.*?)"', index_detail, re.S) for i in res_list: yield 'https://www.pearvideo.com/' + i def parse_detail(detail): # print(detail) download_url=re.findall('srcUrl="(.*?)"',detail,re.S)[0] print(download_url) return download_url def download_movie(url): response=requests.get(url) print(response.content) with open('movie/%s.mp4'%(str(time.time(),)),'wb') as f: f.write(response.content) if __name__ == '__main__': #通过for循环取出5页视频 for i in range(5): #实际下载地址列表 start_url='https://www.pearvideo.com/category_loading.jsp?reqType=6&categoryId=1&start=%s'%(str((i+1)*12)) index_detail = get_page(start_url) urls = parse_index(index_detail) for url in urls: detail = get_page(url) down=parse_detail(detail) download_movie(down)
# 【5】.多线程下载
import re
import time
#
https://www.pearvideo.com/category_1
#用线程池下载
from concurrent.futures import ThreadPoolExecutor
#生产一个有50个线程的池
pool=ThreadPoolExecutor(50)
def get_page(url):
try:
response = requests.get(
url=url,
headers={
'User-Agent': 'Mozilla/5.0
(Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/71.0.3578.98 Safari/537.36'
},
)
if response.status_code
== 200:
return response.text
except Exception as e:
print(e)
def parse_index(index_detail):
# re.S忽略换行符,第一个.*?任意匹配,第二个.*?把匹配的值取出
res_list = re.findall('<li class="categoryem">.*?<a
href="(.*?)"',
index_detail, re.S)
for i in res_list:
yield 'https://www.pearvideo.com/' + i
def parse_detail(detail):
# print(detail)
download_url=re.findall('srcUrl="(.*?)"',detail,re.S)[0]
print(download_url)
return download_url
def download_movie(url):
response=requests.get(url)
print(response.content)
with open('movie/%s.mp4'%(str(time.time(),)),'wb') as f:
f.write(response.content)
def callBack():
print('下载完成')
if __name__ == '__main__':
#通过for循环取出5页视频
for i in range(5):
#实际下载地址列表
start_url='https://www.pearvideo.com/category_loading.jsp?reqType=6&categoryId=1&start=%s'%(str((i+1)*12))
index_detail =
get_page(start_url)
urls = parse_index(index_detail)
for url in urls:
detail = get_page(url)
down=parse_detail(detail)
#如果线程执行结束,回调某个函数
pool.submit(download_movie,down).add_done_callback(callBack)