Frank的学习之路

Day_24_crawler_总结

 

爬虫简介

        -爬取数据(requests,urllib,urllib2模拟发请求)(selenium模块:模拟控制浏览器行为)

        -解析数据:BeautifulSoup:解析html(css选择,xpath选择),re模块

        -存储数据:文件,excel,mysql,redis,mongodb

        -数据分析

request介绍

爬取短视频

 

GET请求和POST请求区别:

        GET请求:

        1.HTTP默认的请求方式就是GET

        2.没有请求体

        3.数据必须在1K之内

        4.GET请求的数据会暴露在浏览器的地址栏中

       

        GET请求的常用的操作:

        1.在浏览器的地址栏中直接给出URL,那么一定是GET请求

        2.点击页面上的超链接也一定是GET请求

        3.提交表单时,表单默认使用GET请求,但可以设置POST

 

        POST请求:

        1.数据不会出现再地址栏中

        2.数据大小没有上限

        3.有请求体

        4.请求体中如果存在中文,会使用URL编码

requests.post()用法与requests.get()完全一致,特殊的是requests.post()有一个data参数,用来存放请求体数据

 

hearders中通常需要传的值有:

User-Agent:

Host:

Referer:上一个请求的地址

Referer:https://www.lagou.com/gongsi

 

#安装requests:pip3 install requests 或者pycharm直接安装

 

request模块介绍

        -get请求

               -携带参数:param

               -携带头:headers

               -携带cookie:cookies

配置内容:

1.基本请求

【2】.一般需要携带头部:User-Agent来模拟是浏览器发送出去的请求

3.带参数的GET请求->cookies
4.单线程爬取短视频

5.多线程下载


import  requests

#【1.基本请求
response=requests.get('https://www.baidu.com')
#请求成功会有状态码200代表成功
if response.status_code==200:
    #响应内容,字符串格式
   
print(response.text)
    #byte格式的数据
    #response.content
   
with open('test.html','wb') as f:
        f.write(response.content)

 

#2.一般需要携带头部:User-Agent来模拟是浏览器发送出去的请求 search=input('请输入需要搜索的内容') #s?wd=%s不建议这么做,因为会涉及到编码问题 # url='https://www.baidu.com/s?wd=%s' %search url='https://www.baidu.com/s' print(url)

response=requests.get(

    url=url,

    headers={

    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' },

    #requests模块内部自动会把参数拼到请求后方     params={'wd':'linux'},

    #因为cookies比较特殊,经常需要带,所以它单独有一个参数     cookies={},

                      ) #请求成功会有状态码200代表成功 if response.status_code==200:

    #响应内容,字符串格式     print(response.text)

    #byte格式的数据     #response.content

    with open('search.html','wb') as f:

        f.write(response.content)
 

 

# 【3.带参数的GET请求->cookies
response = requests.get(url='https://github.com/settings/emails',
                        headers={
                            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
                       
},
                        cookies={'user_session': 'qnmaFqNVfNiQWl_vufpxn63he5EMhCzFSJrkIqmUQItDjLU1'}

                        )
if '200890836@qq.com' in response.text:
    print('登陆了')
else:
    print('看不到这页面')

 

# 4.单线程爬取短视频 import re import time # https://www.pearvideo.com/category_1 def get_page(url):

    try:

        response = requests.get(

            url=url,

            headers={

                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

            },

        )

        if response.status_code == 200:

            return response.text

    except Exception as e:

        print(e) def parse_index(index_detail):

    # re.S忽略换行符,第一个.*?任意匹配,第二个.*?把匹配的值取出     res_list = re.findall('<li class="categoryem">.*?<a href="(.*?)"', index_detail, re.S)

    for i in res_list:

        yield 'https://www.pearvideo.com/' + i def parse_detail(detail):

    # print(detail)

    download_url=re.findall('srcUrl="(.*?)"',detail,re.S)[0]

    print(download_url)

    return download_url def download_movie(url):

    response=requests.get(url)

    print(response.content)

    with open('movie/%s.mp4'%(str(time.time(),)),'wb') as f:

        f.write(response.content) if __name__ == '__main__':

    #通过for循环取出5页视频     for i in range(5):

        #实际下载地址列表         start_url='https://www.pearvideo.com/category_loading.jsp?reqType=6&categoryId=1&start=%s'%(str((i+1)*12))

        index_detail = get_page(start_url)

        urls = parse_index(index_detail)

        for url in urls:

            detail = get_page(url)

            down=parse_detail(detail)

            download_movie(down)

 

# 【5.多线程下载
import re
import time
# https://www.pearvideo.com/category_1
#用线程池下载
from concurrent.futures import ThreadPoolExecutor
#生产一个有50个线程的池
pool=ThreadPoolExecutor(50)
def get_page(url):
    try:
        response = requests.get(
            url=url,
            headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
           
},
        )
        if response.status_code == 200:
            return response.text
    except Exception as e:
        print(e)


def parse_index(index_detail):
    # re.S忽略换行符,第一个.*?任意匹配,第二个.*?把匹配的值取出
   
res_list = re.findall('<li class="categoryem">.*?<a href="(.*?)"', index_detail, re.S)
    for i in res_list:
        yield 'https://www.pearvideo.com/' + i


def parse_detail(detail):
    # print(detail)
    download_url=re.findall('srcUrl="(.*?)"',detail,re.S)[0]
    print(download_url)
    return download_url

def download_movie(url):
    response=requests.get(url)
    print(response.content)
    with open('movie/%s.mp4'%(str(time.time(),)),'wb') as f:
        f.write(response.content)

def callBack():
    print('下载完成')

if __name__ == '__main__':
    #通过for循环取出5页视频
   
for i in range(5):
        #实际下载地址列表
       
start_url='https://www.pearvideo.com/category_loading.jsp?reqType=6&categoryId=1&start=%s'%(str((i+1)*12))
        index_detail = get_page(start_url)
        urls = parse_index(index_detail)
        for url in urls:
            detail = get_page(url)
            down=parse_detail(detail)
            #如果线程执行结束,回调某个函数
           
pool.submit(download_movie,down).add_done_callback(callBack)

 

返回顶部