Today is the first day of 2021. I started this year by learning how to scrape information from web pages, a skill to extract data quickly and efficiently.
I believe extracting a large amount of data in a short time is as unique skill as transforming raw data into useful information. Web scraping is a general and broad topic that I plan to learn in the coming weeks. As a starting point, and to celebrate the new year, I learned scraping HTML tables, a common structure of data storage on the web, into R data frames.
For instance, this page (https://www.dab.gov.af/exchange-rates) contains the AFN exchange rate which is updating on daily basis.
The read_html()
function from the
rvest package can be used to read the above web
page.
# install.packages("rvest")
library(rvest)
data <- read_html("https://www.dab.gov.af/exchange-rates")
Once you read the HTML document, a part of it can be selected
with the html_nodes()
function. For example, the
below function will return all tables from the above-mentioned
page and parse them into data frames.
## [[1]]
## # A tibble: 10 × 5
## Currency `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 USD$ 70.6 70.4 70.5 70.4
## 2 EURO€ 74.0 73.4 73.8 73.5
## 3 POUND£ 88.0 87.2 87.7 87.4
## 4 SWISS₣ 77.5 77.0 77.4 77.2
## 5 INDIAN Rs. 0.792 0.782 0.789 0.785
## 6 PAKISTAN Rs. 0.252 0.244 0.249 0.247
## 7 IRAN Toman 0.0012 0.0006 0.0011 0.0007
## 8 CNY¥ 10.3 9.94 10.2 10.0
## 9 UAE DIRHAM 19.2 19.1 19.2 19.2
## 10 SAUDI RIYAL 18.7 18.6 18.7 18.6
##
## [[2]]
## # A tibble: 2 × 5
## `Currency` `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 USD$ 70.0 69.8 69.9 69.8
## 2 EURO€ 74.0 73.4 73.8 73.5
As the result indicates, the above page has two tables. We can extract a specific table by its index. For example, the function below returns the first table only.
## # A tibble: 10 × 5
## Currency `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 USD$ 70.6 70.4 70.5 70.4
## 2 EURO€ 74.0 73.4 73.8 73.5
## 3 POUND£ 88.0 87.2 87.7 87.4
## 4 SWISS₣ 77.5 77.0 77.4 77.2
## 5 INDIAN Rs. 0.792 0.782 0.789 0.785
## 6 PAKISTAN Rs. 0.252 0.244 0.249 0.247
## 7 IRAN Toman 0.0012 0.0006 0.0011 0.0007
## 8 CNY¥ 10.3 9.94 10.2 10.0
## 9 UAE DIRHAM 19.2 19.1 19.2 19.2
## 10 SAUDI RIYAL 18.7 18.6 18.7 18.6
The XPath selector can also be used with
html_nodes()
function to extract a specific table.
To do so, open the web-page in Google Chrome > right-click on
the desired table > click inspect. It will
open the developer tool that will allow you to view the HTML
codes that generated the web page.
Next, hover over different HTML elements in the developer console until your desired table is highlighted. Then right-click and copy the XPath.
Then use the xpath argument with
html_nodes()
to read your desired table.
html_nodes(data, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>%
html_table()
## [[1]]
## # A tibble: 10 × 5
## Currency `Cash (Sell)` `Cash (Buy)` `Transfer (Sell)` `Transfer (Buy)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 USD$ 70.6 70.4 70.5 70.4
## 2 EURO€ 74.0 73.4 73.8 73.5
## 3 POUND£ 88.0 87.2 87.7 87.4
## 4 SWISS₣ 77.5 77.0 77.4 77.2
## 5 INDIAN Rs. 0.792 0.782 0.789 0.785
## 6 PAKISTAN Rs. 0.252 0.244 0.249 0.247
## 7 IRAN Toman 0.0012 0.0006 0.0011 0.0007
## 8 CNY¥ 10.3 9.94 10.2 10.0
## 9 UAE DIRHAM 19.2 19.1 19.2 19.2
## 10 SAUDI RIYAL 18.7 18.6 18.7 18.6
Now the data is ready for further analysis. Using the same method, I extracted the AFN exchange rate data from 2020-01-01 to 2020-12-31 from the following website: https://www.dab.gov.af/exchange-rates
Feel free to edit the codes in my GitHub account.
# Loading required packages
library(rvest)
library(tidyverse)
#Set the start and end dates
start <- as.Date("01-01-2020",format="%m-%d-%y")
end <- as.Date("12-31-2020",format="%m-%d-%y")
final_data <- rbind()
#Scraping the data from "https://www.dab.gov.af/exchange-rates"
while (start <= end){
print(paste0("Extracting table from https://www.dab.gov.af/exchange-rates?field_date_value=",format(start,"%m/%d/%y")))
df <- read_html(paste0("https://www.dab.gov.af/exchange-rates?field_date_value=",format(start,"%m/%d/%y")))
logic <- html_nodes(df, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>% html_table() %>% unlist()
if (!is.null(logic)) {
df_final <- html_nodes(df, xpath = "/html/body/div[1]/main/section/div/div/div/div[2]/table") %>% html_table() %>% data.frame() %>% mutate(date = start)
}
final_data <- rbind(final_data, df_final)
start <- start + 1
}
#Plotting the data
final_data %>%
filter(Currency %in% c("CNY¥", "EURO€", "INDIAN Rs.", "IRAN Toman", "PAKISTAN Rs.", "POUND£", "SAUDI RIYAL", "SWISS₣", "UAE DIRHAM", "USD$")) %>%
mutate(
Cash_sell = as.numeric(Cash..Sell.),
Cash_buy = as.numeric(Cash..Buy.)
) %>%
select(Currency, date, contains("Cash_")) %>%
pivot_longer(-c("Currency", "date"), names_to = "names", values_to = "values") %>%
filter(!is.na(values)) %>%
ggplot(aes(x = date, y = values, color = Currency)) +
geom_line(show.legend = F) +
geom_point(show.legend = F, size = 0.1, color = "black") +
facet_grid(Currency~names, scales = "free_y") +
theme_bw() +
theme(strip.text.x = element_text(face = "bold"),
strip.text.y = element_text(size = 8),
plot.caption = element_text(color = "blue")
) +
labs(x = NULL, y = NULL, caption = "Source: https://www.dab.gov.af/exchange-rates")