Do you spend a lot of time writing the same code over and over again while working on your data science projects? Well, learning to turn your long chunks of code into a simple reusable function may be exactly what you need.
So…What is a Function?
A function is simply a set of statements used to perform a specific task. So, instead of writing out all of your code statements each time you want to perform a task, you could turn that code script into a function, and just call the function in future scenarios.
Let’s look at an easy example. Say I wanted to know the mean of numbers 3, 4 and 5. Writing out the full calculation would look like this:
(3 + 4 + 5) / 3
[1] 4
However, R has a built-in function called mean() which returns the mean of the numbers specified within the function. Let’s take a look below where I want to find the mean of my numbers list “3, 4 and 5” using mean():
numbers <- c(3,4,5)
mean(numbers)
[1] 4
Here we can see that I simply had to wrap mean() around my list “numbers” to get the mean of the list; this is neater than writing out the full calculation. Now, you might not think that this is a huge improvement when it comes to saving time, but wait a bit and trust me, when your code becomes more complex, you will understand just how handy creating your own functions can be.
It is also important to note that functions can and should be written in a way to work with different values, a.k.a arguments. In the example above, the argument used in mean() was my list of numbers, because I wanted to find the mean of my list of numbers. But, what if you wanted to find the mean of a different list of numbers? Well, you can! All you have to do is use a different list of values when calling mean(). This works because functions are designed with general arguments, not specific values which you only assign when you call the function.
So, what have we learned? That functions allow you to perform a task (any task) with one line of code, and that functions can be reused with different values or arguments. Now that you have a better understanding of what a function is and what it does, let’s look at the syntax of a function in preparation for writing a function.
Syntax of a Function
In order to create a function in R, you must follow the below form:
function_name <- function (arg1, arg2, ...){
statements # your code to complete task indicated by the function
return # return result of the task
}
function_name is the function name that you create based on what the function is supposed to accomplish. In this case, longer names are encouraged to give a clear indication of what the function is meant to perform. Additionally, make sure your name does not have spaces and does not already exist elsewhere in R.
arg1, arg2, etc. are the arguments that you need to specify and are needed in order to accomplish the task of the function.
function body is the code in between the {} which consists of all steps to perform the function task, as well as, output the result.
return is where you indicate what result the function should output.
Now that you know what the form of a function is, we can practice turning a code script into a function with my own example below.
Turning a Script into a Function
Several of my projects in the past have involved scraping text from webpages, cleaning that text and transforming it into a list of words with their word count (the number of times that word appeared in the text.) I would then use the resulting list to analyze and visualize that data in different ways. Since I do the former task a lot, I thought, “why not turn that code script into a function?”
It’s really as simple as assigning a name to my function, copying and pasting my code script into the body of a function and renaming specific values to general arguments. And then I would never have to write all of that code again!
Let’s take a look at what my code script looks like so you can understand all of the steps involved in the task of turning text on a webpage into a list of words and their word count:
##Import Text from Webpage
#Create the URL variable for desired webpage to be scraped
url <- 'https://science.nasa.gov/science-news/news-articles/keeping-an-eye-on-earth'
#Read the webpage HTML code and specific nodes (information) wanted from the webpage
webpage <- read_html(url)
text <- html_nodes(webpage,'p:nth-child(10) , p:nth-child(9) , p:nth-child(8) , p:nth-child(7) , p:nth-child(6) , p:nth-child(5) , p:nth-child(4) , p:nth-child(1)')
#Convert nodes to text
text <- html_text(text)
## Create Corpus with 'corpus' package
data <- as_corpus_frame(text)
## Clean Corpus
# Remove punctuation and stop words form corpus:
words <- term_stats(data, drop_punct = TRUE, drop = stopwords_en)
# Drop Column 3
words <- words[,1:2]
# Rename column names
colnames(words) <- c("word", "count")
print(words)
You will notice that this example scrapes text from a NASA article webpage. Additionally, the HTML nodes in this code are associated with the text to be scraped from the page; these HTML nodes for any webpage can be found using the Chrome extension ‘SelectorGadget‘.
Looking at my code above – there is a lot going on there, right? This is a lot to write every time that I want to scrape text from a webpage. So, let’s turn this into a function!
Step One: Create Function Name
Since the task of my function is to turn text from a webpage into a list of words and their word count, I am going to call my function “web_words_to_list()” (I know, doesn’t sound so great, so let me know if you think of something else.)
Step Two: Copy Code Script into Function Body
web_words_to_list <- function(Arg1, Arg2) {
url <- 'https://science.nasa.gov/science-news/news-articles/keeping-an-eye-on-earth'
webpage <- read_html(url)
text <- html_nodes(webpage,'p:nth-child(10) , p:nth-child(9) , p:nth-child(8) , p:nth-child(7) , p:nth-child(6) , p:nth-child(5) , p:nth-child(4) , p:nth-child(1)')
text <- html_text(text)
data <- as_corpus_frame(text)
words <- term_stats(data, drop_punct = TRUE, drop = stopwords_en)
words <- words[,1:2]
colnames(words) <- c("word", "count")
print(words)
}
Following the syntax of a function, I have copied and pasted my code script into the function body and assigned it to my function name, web_words_to_list. Here, you will notice that the function is still specific to the NASA article and it’s nodes, and that I have not yet determined the functions arguments; this is the next step.
Step Three: Replace Specific Values in the Function Body with General Arguments
web_words_to_list <- function(url, nodes) {
url <- url
webpage <- read_html(url)
text <- html_nodes(webpage, nodes)
text <- html_text(text)
data <- as_corpus_frame(text)
words <- term_stats(data, drop_punct = TRUE, drop = stopwords_en)
words <- words[,1:2]
colnames(words) <- c("word", "count")
print(words)
Within the function, I have removed the NASA URL and it’s nodes replacing them with general “url” and “nodes” arguments. Now that the function has been generalized we can assign any URL and any node to the function’s arguments. This means we can put the function to use!
Step Four: Try out the Function
Let’s try out web_words_to_list() on the NASA webpage:
web_words_to_list(url = 'https://science.nasa.gov/science-news/news-articles/keeping-an-eye-on-earth', nodes = 'p:nth-child(10) , p:nth-child(9) , p:nth-child(8) , p:nth-child(7) , p:nth-child(6) , p:nth-child(5) , p:nth-child(4) , p:nth-child(1)'

And as a result we get a list of words used in the article, as well as, their word count. Here is a preview of that list:
Using web_words_to_list() was much easier than writing out all of my code – wouldn’t you agree? And now I don’t need to write it ever again because I have this handy function. If you want to try out the web_words_to_list function for yourself, click here for the reference code.
Not only does turning frequently used code into functions save time in the future, but it is reusable, it makes your code easier to read, it reduces errors and most importantly – it allows you to focus on solving bigger problems.
Now that you know what a function is, how to use one and how to write one, I hope that you try making your own to improve and organize your workflow.