Scrape historical weather data with requests, BeautifulSoup, pandas and SQLAlchemy

Reading Time: 10 minutes

Hello everyone, it’s been a while. Today, let’s talk about the Earth!

As you may know, many highest-temperature-ever records have been broken in this summer. Another iceberg is floating and melting away. And yet Amazon is red. Climate change is definitely not a goddamn joke. It breathes fire!

I have been thinking seriously about climate change for a while now (yeah, just thinking!). And in order to actually do something to bring about changes, it’s necessary to understand what is happening out there.

This year’s July was ridiculously colder than normal, wasn’t it?

I kept asking this question without any clues to help prove that my feeling was right.

Well, there is only one way to find out. We will have to look at the boring numbers!

So, we have walked through the problem. Let’s talk about the HOW.

Methods for getting weather data

Weather data, like other data of other categories, can be obtained in many ways. One can think of any of the following:

  • Download CSV files directly from some source

This may be the solution that we want because CSV data is usually cleaned and properly shaped. Unfortunately, such kind of data is rarely available.

  • Use one of the weather data APIs available publicly

Another option is to use some public API. Hmm, that is a great idea except for the fact that: historical weather data is not provided for free!

  • Freely roam over the Internet and grab what we need!

The final option is to collect data on our own which is usually referred to as “Web Scrapping”. In the world we live today, data is available over the Internet so basically, we can get pretty much any piece of information that we need.

Of course, it requires some additional skills and knowledge to get there. Stay calm, it won’t be so hard.

So we are now clear on the methodology. It’s time to get our hands dirty.

Objectives

So, WHAT comes next? We will utilize Python’s power in order to:

  • Grab the weather data of Kanagawa prefecture of Japan
  • Store the data with SQLite
  • Do some analysis to answer the silly question of mine: Was this year’s July colder than it should be?

The mission is also clear. Let’s talk about the WHERE.

The data source

We need to know the place from which we will get the data. Because I’m currently in Japan, I will use the well-known tenki.jp. After a while, I figured out that URLs of historical weather data for Kanagawa prefecture would look like below:

https://tenki.jp/past/<year>/<month>/<day>/amedas/3/17/

You can do the same to get the URLs for your place. Weather websites store historical data somewhere, we just need to figure that out. It’ll take a couple of minutes though 😉

Dependency installation

Before we start coding, we need to install some necessary packages. You may already know most of them.

pip install numpy pandas matplotlib seaborn bs4 requests sqlalchemy lxml

Some look new? Don’t worry, we will see why we need them very soon. Okay, let’s code!

Implementation

1. Grab & convert data from a website into pandas DataFrame

Let’s tackle Objective No. 1: get the weather data from the net. Concretely, the steps that we’re gonna take are as follows: (with the necessary package for each step)

The route is clear. Let’s first import all the packages that we have installed above:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import requests
from datetime import datetime, timedelta

Next, we need to know the page’s address, which is also called URL. Except for the date information, the URLs look exactly the same. Let’s create a template for them:

url_template = 'https://tenki.jp/past/{}/{:02d}/{:02d}/amedas/3/17/'

Let’s examine the link and see how the page looks like. For example, here’s the page for January 1st, 2015 and the part of the page that we want to extract information from:

In order to get the information, we need to understand the structure of the page. Simply speaking, we want to look at the HTML source and figure out the tag name, its index, etc, anything that distinguish our region of interest from other parts.

In my case, the information lies in a table (mostly all the time it’s a table) and have a class named amedas-point-detail-entries-table.

We have done inspecting the target page. Let’s write some Python code. First, we will use requests library to make HTTP GET request. The return value will be a string containing the HTML code of the whole page.

url = URL_TEMPLATE.format(2015, 1, 1) # URL to get the data on 2015/1/1
html_doc = requests.get(url).text

The HTML code, which is stored inside html_doc, is not so helpful. We need to turn that raw HTML code into something that we can interact with. And it’s time for BeautifulSoup to shine. Want to get some?

soup = BeautifulSoup(html_doc)

BeautifulSoup helps us re-format the raw HTML code so that it’s easier to look at. You can try it out by printing out soup.prettify().

More than just prettifying, BeautifulSoup also turns raw HTML codes into an interactable object, which we can access its tags like class attribute. Below is how we’re gonna get the table with historical weather data as we discussed above:

table = soup.find(lambda tag: tag.name == 'table' and 'amedas-point-detail-entries-table' in tag['class'])

Now we have a string containing the weather data table. The next step is simple, we just need to use the read_html function from pandas to obtain a DataFrame!

df = pd.read_html(str(table), na_values=['---'], header=0)[0]

Seems like we need get rid of the last row and change the columns’ names. Finally, let’s add a date column to keep DateTime information.

df.drop(df.shape[0] - 1, inplace=True)
df.columns = ['city', 'max_temp', 'min_temp', 'rain', 'wind', 'daylight']
df['date'] = pd.to_datetime('{}/{}/{}'.format(2015, 1, 1))

Let’s print out df. We now have the data for January 1st, 2015!

Since we will repeat the whole process to get all the data from January 1st, 2015, we’d better make a function for that. Let’s call it get_weather_for_one_day.

def get_weather_for_one_day(year, month, day):
    url = url_template.format(year, month, day)
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc)
    table = soup.find(lambda tag: tag.name == 'table' and 'amedas-point-detail-entries-table' in tag['class'])

    try:
        df = pd.read_html(str(table), na_values=['---'], header=0)[0]
    except Exception as e:
        print(e)
        raise ValueError('No data is available for {}/{:02d}/{:02d}'.format(year, month, day))
    df.drop(df.shape[0] - 1, inplace=True)
    df.columns = ['city', 'max_temp', 'min_temp', 'rain', 'wind', 'daylight']
    df['date'] = pd.to_datetime('{}/{}/{}'.format(year, month, day))
    
    return df

Next, let’s create another function that helps us download weather data from a specific date and concatenate them into one large DataFrame.

def get_weather_from_date(year, month, day):
    date = datetime(year, month, day)
    today = datetime.utcnow()
    df = []
    while (today - date).days >= 1:
        try:
            df.append(get_weather_for_one_day(date.year, date.month, date.day))
        except ValueError as e:
            print(e)
        finally:
            date += timedelta(days=1)
    return pd.concat(df)

Depends on which date you want to start from, it might take a while for the data to be fully downloaded. Feel free to alter the code above to split the download into smaller chunks.

Okay, after a while, the data is now ready. Let’s have a quick insight into the DataFrame we got:

>> df = get_weather_from_date(2015, 1, 1)
>> df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18810 entries, 0 to 10
Data columns (total 7 columns):
city        18810 non-null object
max_temp    8550 non-null object
min_temp    8550 non-null object
rain        18810 non-null object
wind        8550 non-null object
daylight    8548 non-null object
date        18810 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(6)
memory usage: 1.1+ MB

Hmm, 18810 entries in total, but only 8550 of them have values. We could recall from the data of January 1st, 2015 that more than half of the cities have NaN values. Luckily enough, Yokohama, which is the city where I live, doesn’t!

Let’s make sure that we have the latest data, which is September 6th, 2019 at the time of writing.

That’s great. It’s time to clean up the data. Below are what we’re gonna do:

  • Convert the city column to English, since most of my readers are non-Japanese speakers
  • Clean the parentheses of max_temp, min_temp, wind columns
  • Convert string numbers to numeric types
  • Fill NaN values with 1e-9
  • Reset the index column

Here is the function to do all of the above:

def clean_data(df, city_mapping):
    df = df.copy()
    df.city = df.city.map(city_mapping)
    temp = df.max_temp.str.split()
    df.max_temp = pd.to_numeric(temp.str.get(0))
    temp = df.min_temp.str.split()
    df.min_temp = pd.to_numeric(temp.str.get(0))
    temp = df.wind.str.split()
    df.wind = pd.to_numeric(temp.str.get(0))
    df.rain = pd.to_numeric(df.rain)
    
    df = df.fillna(-1e9)
    
    df = df.reset_index(drop=True)
    
    return df

Let’s define city_mapping, which is a Python dict, and call the function out:

>> city_mapping = {
    '横浜': 'Yokohama',
    '日吉': 'Hiyoshi',
    '海老名': 'Ebina',
    '三浦': 'Miura',
    '辻堂': 'Tsujido',
    '相模原中央': 'Sagamihara-Chuo',
    '平塚': 'Hiratsuka',
    '相模湖': 'Sagamiko',
    '丹沢湖': 'Tanzawako',
    '小田原': 'Odawara',
    '箱根': 'Hakone'
}
>> cleaned_df = clean_data(df, city_mapping)
>> cleaned_df.head(10)

The result is as follows:

Great. Now the data is ready to be stored.

2. Store the data to the database

First question though, why don’t we just save the data into a CSV file?

Well, CSV files are great in experiment phases. Thanks to courses and especially, Kaggle competitions, you might be overly familiar with pd.read_csv and of course, pd.to_csv as well.

But in real-life applications, that might be a bad idea. Some reasons that using some kind of database is the better approach are:

  • Speed: database is faster to extract just some part of the data through querying
  • Vulnerability: database is more secure because of its authentication functionality
  • Data concurrency: multiple read & write can be performed at the same time
  • etc

Okay, enough talking. Let’s get down to business. Firstly, let’s export cleaned data to a CSV file just in case something breaks.

cleaned_df.to_csv('kanagawa_weather_20150101_20190904.csv', index=False)

Now we are safe. Let’s import the necessary dependencies. Everything we need is provided in a package called sqlalchemy, which we have downloaded in the beginning.

from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Float, DateTime
from sqlalchemy.schema import UniqueConstraint
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import IntegrityError

We will use SQLite within this post. SQLite may lack some key functionalities such as user authentication or data concurrency, but it is a good start and what we do with SQLite can be applied to other databases with minimum modification (you will see in the next post).

First, we need to create an engine which will help us communicate with the database.

engine = create_engine('sqlite:///weathers.sqlite')

Next, we will define a schema for the weather data. Basically, a schema defines the structure of the table and the data type of each column. sqlalchemy provides us a way to do this in an object-oriented manner.

Base = declarative_base()

class WeatherInfo(Base):
    __tablename__ = 'weatherinfo'
    
    id = Column(Integer, primary_key=True)
    date = Column(DateTime, nullable=False)
    city = Column(String(50), nullable=False)
    max_temp = Column(Float)
    min_temp = Column(Float)
    wind = Column(Float)
    rain = Column(Float)
    daylight = Column(Float)
    
    __table_args__ = (UniqueConstraint('date', 'city'),)
    
    def __init__(self, **kwargs):
        super(WeatherInfo, self).__init__(**kwargs)
    
    def __repr__(self):
        return f'<WeatherInfo(date={self.date}, city={self.city}, max_temp={self.max_temp}, min_temp={self.min_temp}, wind={self.wind}, daylight={self.daylight})>'

So, what happened here? We just defined a table called weathers with the exact columns as the actual data. We also stated that the date and city columns should not be null, as well as the combination of them should be unique.

Let’s create the weathers table:

if Weather.__table__.exists(engine):
    Weather.__table__.drop(engine)
Weather.__table__.create(engine)

What else do we need? A session bound to the engine to interact with the database.

Session = sessionmaker(bind=engine)
session = Session()

Typically, below is how to add a new record to the database:

weather = Weather(city=..., date=..., max_temp=..., min_temp=..., wind=..., rain=..., daylight=...)
session.add(weather)
session.commit()

In order to apply that pattern to the DataFrame above, we need to convert it into a list of Python dict. Luckily again, pandas is ridiculously great at this. Just remember to pass 'index' when calling to_dict method.

dict_df = cleaned_df.to_dict('index')

Okay, it’s time to put the data into the database:

try:
    session.add_all([WeatherInfo(**rec) for rec in dict_df.values()])
    session.commit()
except IntegrityError:
    session.rollback()
    print('Duplicate record!')

Also, it’s always a good practice to watch out for the IntegrityError, which indicates that something happened during the commission.

Let’s query back some data from the database to ensure that our data was safely added.

>> session.query(Weather).filter_by(city='Yokohama').limit(10).all()

[<Weather(date=2015-01-01 00:00:00, city=Yokohama, max_temp=8.5, min_temp=0.6, wind=8.8, daylight=0.1)>,
 <Weather(date=2015-01-02 00:00:00, city=Yokohama, max_temp=7.7, min_temp=-0.8, wind=4.9, daylight=5.9)>,
 <Weather(date=2015-01-03 00:00:00, city=Yokohama, max_temp=8.1, min_temp=0.4, wind=6.1, daylight=9.6)>,
 <Weather(date=2015-01-04 00:00:00, city=Yokohama, max_temp=9.3, min_temp=1.0, wind=3.0, daylight=6.7)>,
 <Weather(date=2015-01-05 00:00:00, city=Yokohama, max_temp=13.2, min_temp=3.0, wind=6.3, daylight=9.3)>,
 <Weather(date=2015-01-06 00:00:00, city=Yokohama, max_temp=16.6, min_temp=6.5, wind=11.2, daylight=1.4)>,
 <Weather(date=2015-01-07 00:00:00, city=Yokohama, max_temp=9.5, min_temp=3.7, wind=8.7, daylight=4.1)>,
 <Weather(date=2015-01-08 00:00:00, city=Yokohama, max_temp=11.3, min_temp=2.9, wind=7.1, daylight=9.5)>,
 <Weather(date=2015-01-09 00:00:00, city=Yokohama, max_temp=11.1, min_temp=3.5, wind=5.4, daylight=9.2)>,
 <Weather(date=2015-01-10 00:00:00, city=Yokohama, max_temp=11.5, min_temp=2.4, wind=3.9, daylight=9.6)>]

If you check the working directory, you will find a file named weather.sqlite of around 2.3MB in size, which indicates that the data was successfully put into the database.

Okay, we have walked through objective No. 2. What we have done so far has got us ready for the next step of creating a RESTful API, which will the topic of the next post.

For now, let’s do some analysis on the weather data of Yokohama to see if my feeling was correct: wasn’t it cold this July?

3. Analyze Yokohama weather data

One thing great about using a database is that we can get back the data very easily. Furthermore, we can apply some filters on it to get just the piece of information we need. For example, here’s how to get the data of Yokohama from the table:

df_yoko = pd.read_sql("SELECT * FROM weathers WHERE city = 'Yokohama'", con=engine, parse_dates='date', index_col='id')
df_yoko = df_yoko.set_index('date')
df_yoko.head()

And here is how the result looks like:

Cool, isn’t it? Now we have a DataFrame which contains Yokohama weather data from 2015/1/1 to 2019/9/6. Let’s take a look at what the weather of a typical year in Japan looks like.

df_2015 = df_yoko.loc[df_yoko.index.year == 2015, :]
df_2015.resample('W').mean().plot(y='max_temp', figsize=(20, 8))

Here is the graph:

In Japan, the temperature in summer reaches its peak at around the first two weeks of August, which means that in July, it’s pretty normal to be already so damn hot. Let’s zoom in the temperature of July 2015:

df_2015_7 = df_yoko.loc[(datetime(2015, 7, 1) <= df_yoko.index) & (df_yoko.index < datetime(2015, 8, 1)), :]
df_2015_7.plot(y='max_temp', figsize=(20, 8))

Okay, that’s a little bit about the context. Now let’s first see how the minimum temperatures in July from 2015 to 2019 differ. We will use a barplot to see the average and how the temperatures vary within July:

df_yoko_july = df_yoko[df_yoko.index.month == 7]
df_yoko_july.pivot_table(values='min_temp', columns=df_yoko_july.index.year, index=df_yoko_july.index.day).plot(kind='box', figsize=(20, 8))

And of course, the max temperatures as well:

As we can see, 2019 has the lowest average July temperature. Although the temperature largely varies than other years, 2019’s July was inarguably cooler, especially than 2017 and 2018.

By the way, 2018 was goddamn hot! Look at the graphs and we can see that the 75th percentile of 2019 is even lower than the 25th percentile of 2018.

To be more precise, let’s see which year has the most hottest days (of all 5 years), e.g. on July 1st of which year is the hottest and so on.

>> idx = df_yoko_july.groupby(df_yoko_july.index.day)['max_temp'].idxmax()
>> max_temp_july = df_yoko_july.loc[idx]
>> max_temp_july.index.year.value_counts()

2018    14
2015     7
2017     7
2016     3
Name: date, dtype: int64

Of all 31 days in July, 2018 is the champion with 14 days on which it was the hottest. What a year it was! 2019 didn’t even show up in the list.

How about the most coolest days?

>> idx = df_yoko_july.groupby(df_yoko_july.index.day)['min_temp'].idxmin()
>> min_temp_july = df_yoko_july.loc[idx]
>> min_temp_july.index.year.value_counts()

2019    11
2016     9
2015     7
2017     3
2018     1
Name: date, dtype: int64

Well, the numbers don’t lie. 2019 came first with 11/31 days on which it was the coolest. 2018 managed to get into the list with 1 day. Not so bad 😉

Of course, the temperature varies throughout the month and if you are curious, below is the graph of July’s temperature of all 5 years.

I found it pretty interesting that matplotlib used the red color for 2018. Can’t be more accurate 😉

So, that was enough to tell that it wasn’t non-sense to feel that this year’s July was magically cool (I even felt cold!). Of course, you can dig deeper and do more analysis or testing but it was beyond the scope of this post. Feel free to share your findings 😉

Conclusion

So, we have gone a long way until now. Great job everyone. Hopefully, after this post, you can:

  • Get content of a page and extract information from it
  • Preprocess the data and do some analysis
  • Save the data to the database for further use

The jupyter notebook for this blog post can be found at my gist.

In the coming blog post, we will create a RESTful API using Flask framework. Stay tuned and I will see you there!

Trung Tran is a Deep Learning Engineer working in the car industry. His main daily job is to build deep learning models for autonomous driving projects, which varies from 2D/3D object detection to road scene segmentation. After office hours, he works on his personal projects which focus on Natural Language Processing and Reinforcement Learning. He loves to write technical blog posts, which helps spread his knowledge/experience to those who are struggling. Less pain, more gain.

Leave a reply:

Your email address will not be published.