Hello 2021!

Fahim Ahmad | 2021-01-01

Today is the first day of 2021. I started this year by learning how to scrape information from web pages, a skill to extract data quickly and efficiently.

I believe extracting a large amount of data in a short time is as unique skill as transforming raw data into useful information. Web scraping is a general and broad topic that I plan to learn in the coming weeks. As a starting point, and to celebrate the new year, I learned scraping HTML tables, a common structure of data storage on the web, into R data frames.

For instance, this page (https://www.dab.gov.af/exchange-rates) contains the AFN exchange rate which is updating on daily basis.

The read_html() function from the rvest package can be used to read the above web page.

# install.packages("rvest")
library(rvest)

data <- read_html("https://www.dab.gov.af/exchange-rates")

Once you read the HTML document, a part of it can be selected with the html_nodes() function. For example, the below function will return all tables from the above-mentioned page and parse them into data frames.

html_nodes(data, "table") %>% 
  html_table()

## [[1]]
## # A tibble: 10 × 5
##    Currency     `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
##    <chr>                <dbl>        <dbl>             <dbl>            <dbl>
##  1 USD$               70.6         70.4              70.5             70.4   
##  2 EURO€              74.0         73.4              73.8             73.5   
##  3 POUND£             88.0         87.2              87.7             87.4   
##  4 SWISS₣             77.5         77.0              77.4             77.2   
##  5 INDIAN Rs.          0.792        0.782             0.789            0.785 
##  6 PAKISTAN Rs.        0.252        0.244             0.249            0.247 
##  7 IRAN Toman          0.0012       0.0006            0.0011           0.0007
##  8 CNY¥               10.3          9.94             10.2             10.0   
##  9 UAE DIRHAM         19.2         19.1              19.2             19.2   
## 10 SAUDI RIYAL        18.7         18.6              18.7             18.6   
## 
## [[2]]
## # A tibble: 2 × 5
##   `Currency` `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
##   <chr>              <dbl>        <dbl>             <dbl>            <dbl>
## 1 USD$                70.0         69.8              69.9             69.8
## 2 EURO€               74.0         73.4              73.8             73.5

As the result indicates, the above page has two tables. We can extract a specific table by its index. For example, the function below returns the first table only.

html_nodes(data, "table") %>% 
  html_table() %>% 
  .[[1]]

## # A tibble: 10 × 5
##    Currency     `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
##    <chr>                <dbl>        <dbl>             <dbl>            <dbl>
##  1 USD$               70.6         70.4              70.5             70.4   
##  2 EURO€              74.0         73.4              73.8             73.5   
##  3 POUND£             88.0         87.2              87.7             87.4   
##  4 SWISS₣             77.5         77.0              77.4             77.2   
##  5 INDIAN Rs.          0.792        0.782             0.789            0.785 
##  6 PAKISTAN Rs.        0.252        0.244             0.249            0.247 
##  7 IRAN Toman          0.0012       0.0006            0.0011           0.0007
##  8 CNY¥               10.3          9.94             10.2             10.0   
##  9 UAE DIRHAM         19.2         19.1              19.2             19.2   
## 10 SAUDI RIYAL        18.7         18.6              18.7             18.6

The XPath selector can also be used with html_nodes() function to extract a specific table. To do so, open the web-page in Google Chrome > right-click on the desired table > click inspect. It will open the developer tool that will allow you to view the HTML codes that generated the web page.

Next, hover over different HTML elements in the developer console until your desired table is highlighted. Then right-click and copy the XPath.

Hint: It will be a line with <table> tag.

Then use the xpath argument with html_nodes() to read your desired table.

html_nodes(data, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>% 
  html_table()

## [[1]]
## # A tibble: 10 × 5
##    Currency     `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
##    <chr>                <dbl>        <dbl>             <dbl>            <dbl>
##  1 USD$               70.6         70.4              70.5             70.4   
##  2 EURO€              74.0         73.4              73.8             73.5   
##  3 POUND£             88.0         87.2              87.7             87.4   
##  4 SWISS₣             77.5         77.0              77.4             77.2   
##  5 INDIAN Rs.          0.792        0.782             0.789            0.785 
##  6 PAKISTAN Rs.        0.252        0.244             0.249            0.247 
##  7 IRAN Toman          0.0012       0.0006            0.0011           0.0007
##  8 CNY¥               10.3          9.94             10.2             10.0   
##  9 UAE DIRHAM         19.2         19.1              19.2             19.2   
## 10 SAUDI RIYAL        18.7         18.6              18.7             18.6

Now the data is ready for further analysis. Using the same method, I extracted the AFN exchange rate data from 2020-01-01 to 2020-12-31 from the following website: https://www.dab.gov.af/exchange-rates

Feel free to edit the codes in my GitHub account.

#  Loading required packages
library(rvest)  
library(tidyverse)

#Set the start and end dates
start <- as.Date("01-01-2020",format="%m-%d-%y")  
end   <- as.Date("12-31-2020",format="%m-%d-%y")  
final_data <- rbind()  

#Scraping the data from "https://www.dab.gov.af/exchange-rates"
while (start <= end){  
  
  print(paste0("Extracting table from https://www.dab.gov.af/exchange-rates?field_date_value=",format(start,"%m/%d/%y")))  
  df <- read_html(paste0("https://www.dab.gov.af/exchange-rates?field_date_value=",format(start,"%m/%d/%y")))  
  logic <- html_nodes(df, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>% html_table() %>% unlist()  
    
  if (!is.null(logic)) {  
    df_final <- html_nodes(df, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>% html_table() %>% data.frame() %>%  mutate(date = start)  
  }  
    
  final_data <- rbind(final_data, df_final)  
  start <- start + 1                      
}  

#Plotting the data
final_data %>%   
  filter(Currency %in% c("CNY¥", "EURO€", "INDIAN Rs.", "IRAN Toman", "PAKISTAN Rs.", "POUND£", "SAUDI RIYAL", "SWISS₣", "UAE DIRHAM", "USD$")) %>%   
  mutate(  
    Cash_sell = as.numeric(Cash..Sell.),  
    Cash_buy = as.numeric(Cash..Buy.)  
    ) %>%   
  select(Currency, date, contains("Cash_")) %>%   
  pivot_longer(-c("Currency", "date"), names_to = "names", values_to = "values") %>%   
  filter(!is.na(values)) %>%   
  ggplot(aes(x = date, y = values, color = Currency)) +  
  geom_line(show.legend = F) +  
  geom_point(show.legend = F, size = 0.1, color = "black") +  
  facet_grid(Currency~names, scales = "free_y") +  
  theme_bw() +  
  theme(strip.text.x = element_text(face = "bold"),  
        strip.text.y = element_text(size = 8),  
        plot.caption = element_text(color = "blue")  
        ) +  
  labs(x = NULL, y = NULL, caption = "Source: https://www.dab.gov.af/exchange-rates")