一个简单的爬虫例子-天气

一、设计任务

目标：用Python设计一个数据抓取程序，达到以下基本要求：

数据抓取任务自拟，如电子商务交易数据、客户评论、新闻、图片等。
获取的数据存储为数据文件，或sqlite数据库。

程序有适当的注释，有完整的说明文件。

二、数据来源

本爬虫程序爬取的数据均来自于中国天气网城市首页的72小时天气预报（日期、天气现象、气温及空气质量）及某时刻实时天气实况，具体网址如下：

http://www.weather.com.cn/weather1d/101280101.shtml#dingzhi_first%EF%BC%89

打开网址，查询：甘肃-酒泉-酒泉，可得如下界面：

我的设想，就是从这个界面中，爬取酒泉72小时天气预报（日期、天气现象、气温及空气质量）及某时刻实时天气实况。

三、爬取工具和环境配置

Python环境安装配置：安装Python所需要的环境，使用python3.9版本.

需要使用到的库：urllib.request、csv以及BeautifulSoup

BeautifulSoup库需要手动安装，BeautifulSoup是一个网页解析库，它支持很多解析器，不过最主流的有两个。一个是python标准库，一个是lxml HTML 解析器。两者的使用方法相似：

from bs4 import BeautifulSoup

# Python的标准库

BeautifulSoup(html, 'html.parser')

# lxml

BeautifulSoup(html, 'lxml')

四、分析过程

1.查看网页源代码

下面我给出了网页源代码的头部，我们需要分析的关键信息是找出想爬取信息对应的代码。

<!DOCTYPE html>
	<html>
	<head>
	<link rel="dns-prefetch" href="http://i.tq121.com.cn">
	<meta charset="utf-8" />
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
	<title>酒泉天气预报,酒泉7天天气预报,酒泉15天天气预报,酒泉天气查询 - 中国天气网</title>
	<meta http-equiv="Content-Language" content="zh-cn">
	<meta name="keywords" content="酒泉天气预报,jqtq,酒泉今日天气,酒泉周末天气,酒泉一周天气预报,酒泉15日天气预报,酒泉40日天气预报" />
	<meta name="description" content="酒泉天气预报，及时准确发布中央气象台天气信息，便捷查询北京今日天气，酒泉周末天气，酒泉一周天气预报，酒泉15日天气预报，酒泉40日天气预报，酒泉天气预报还提供酒泉各区县的生活指数、健康指数、交通指数、旅游指数，及时发布酒泉气象预警信号、各类气象资讯。" />
	<!-- 城市对比上线
	<link type="text/css" rel="stylesheet" href="http://c.i8tq.com/cityListCmp/cityListCmp.css?20191230" />
	<link type="text/css" rel="stylesheet" href="http://c.i8tq.com/cityListCmp/weathers.css?20191230" /> -->
	<style>

可以看出此网站的天气有wea、tem、win三个属性，均写在p标签里，没有定义父标签，可单独直接抓取。

2.爬虫的编写

（1）相关包的导入

import csv

import urllib.request

from bs4 import BeautifulSoup

（2）模拟浏览器得到数据

url = "http://www.weather.com.cn/weather/101270101.shtml"

header = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36") # 设置头部信息

opener = urllib.request.build_opener() # 修改头部信息

opener.addheaders = [header]         #修改头部信息

request = urllib.request.Request(url)   # 制作请求

response = urllib.request.urlopen(request)   # 得到请求的应答包

html = response.read()   #将应答包里面的内容读取出来

html = html.decode('utf-8')    # 使用utf-8进行编码，不重新编码就会成乱码

（3）查找要爬取的部分

在页面上找到所需要的信息部分，需要日期、天气以及温度。

# 以上部分的代码如下：

final = [] #初始化一个空的list，我们为将最终的的数据保存到list

bs = BeautifulSoup(html,"html.parser") # 创建BeautifulSoup对象

body = bs.body # 获取body部分

data = body.find('div',{'id':'7d'}) # 找到id为7d的div

之后再往下看，所需要的信息都存在ul标签中，我们需要查找ul标签

ul = data.find('ul') # 获取ul部分，由于ul标签只有一个我们使用find()函数，如果有多个我们使用find_all()

所需要的信息在ul标签里面的li标签内部，而且不止一个，所以我们需要使用find_all()方法

li = ul.find_all('li') # 获取所有的li 返回的是list对象

（4）对查找到部分进行数据的爬取

我们最后将所有的数据保存在list之中在进行写入文件。

日期在li标签的h1标签之中。

天气在li标签的第一个p标签之中。

温度在第二个p标签之中的span标签之中。

i = 0

for day in li: # 对每个li标签中的内容进行遍历

    if i < 7:

        temp = []

        date = day.find('h1').string # 找到日期

#         print (date)

        temp.append(date) # 添加到temp中

    #     print (temp)

        inf = day.find_all('p') # 找到li中的所有p标签

    #     print(inf)

    #     print (inf[0])

        temp.append(inf[0].string) # 第一个p标签中的内容（天气状况）加到temp中

        if inf[1].find('span') is None:

            temperature_highest = None # 天气预报可能没有当天的最高气温（到了傍晚，就是这样），需要加个判断语句,来输出最低气温

        else:

            temperature_highest = inf[1].find('span').string # 找到最高温度

            temperature_highest = temperature_highest.replace('℃', '') # 到了晚上网站会变，最高温度后面也有个℃

        temperature_lowest = inf[1].find('i').string #找到最低温度

        temperature_lowest = temperature_lowest.replace('℃', '') # # 最低温度后面有个℃，去掉这个符号

        temp.append(temperature_highest)

        temp.append(temperature_lowest)

        final.append(temp) # 将每一次循环的list的内容都插入最后保存数据的list

        i = i +1

（5）写入文件

with open('天气.txt', 'a', errors='ignore', newline='') as f:

f_csv = csv.writer(f)

f_csv.writerows(final)

五、爬取效果展示

1.源代码截图

2.运行效果截图

3.数据文件存储截图

六、完整代码

# !/usr/bin/env python3

# -*- coding: utf-8 -*-

import requests

from bs4 import BeautifulSoup

import time

def getINFO(city='jiuquan'):

    url = 'https://m.tianqi.com/{}/'.format(city)

    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.64 Safari/537.36'}

    r = requests.get(url, headers=headers, timeout=30)

    r.raise_for_status()

    r.encoding = r.apparent_encoding

    html = r.text

    soup = BeautifulSoup(html,'html.parser')

    # 获取当前位置

    getLocation = soup.find('h2').text

    print(getLocation)

    # 获取更新时间

    getUpdatetime = soup.find(id='nowHour').text

    getUpdatetime = '更新时间 ' + getUpdatetime

    print(getUpdatetime)

    # 获取当前温度

    getWeather_now = soup.find(class_='now').text

    getWeather_now = '现在温度 ' + getWeather_now

    print(getWeather_now)

    # 获取当天天气

    getWeather = soup.find('dd', class_='txt').text

    getWeather = '今日天气 ' + getWeather

    print(getWeather)

    # 获取当天空气质量

    getAir = soup.find(class_='b1').text

    getAir = '空气质量 ' + getAir

    print(getAir)

    # 获取当前湿度

    getWet = soup.find(class_='b2').text

    print(getWet)

    # 获取当前风力

    getWind = soup.find(class_='b3').text

    print(getWind)

    print('\n' + '-'*10 + '\n')

    # 把多个天气信息组合成一个文本

    weather_info = getLocation + '\n' + getUpdatetime + '\n' + getWeather_now + '\n' + getWeather + '\n' + getAir + '\n' + getWet + '\n' + getWind

    # print(weather_info)



    Temperature = ''

    # 获取未来几天的天气

    getTemperature = soup.find(class_='weather_week')

    # 筛选未来3天天气并对格式做调整，合并为一个文本

    for i in getTemperature.find_all('dl')[:3]:

        i = i.text

        li = i.split('\n')

        li[7] = '空气质量 ' + li[7]

        li = li[2:8]

        li = ' '.join(li)

        Temperature = Temperature + li + '\n'



    print(Temperature)



    result_info = weather_info + '\n' + '-'*40 + '\n' + Temperature

    # print(result_info)

    return result_info



# 写到本地

def saveFile(text):

    with open("./天气.txt", "w", encoding='utf-8') as f:

        f.write(text)

if __name__ == "__main__":

    while True:

        city=input("输入查询城市的拼音(如酒泉输入jiuquan)：")

        result_info = getINFO(city)

        saveFile(result_info)

        answer=input("是否继续查询? y/n")

        if answer=="y" or answer=="Y":

            continue

        else:

            break

# 循环一小时更新

for second in range(3600,-1,-1):

     time.sleep(1)

     print('天气更新倒计时：' + "%02d:%02d"%(second // 60,second % 60), end='\r\b')