How to Scrape/Crawl Research Data Using Selenium WebDriver – and Java

selenium

As a researcher, there are many times you will need to assemble a dataset of information in the public domain (on websites) for research studies. E.g you want to analyse thousands  comments on a forum, or download and process hundreds of .csv files from an online databank etc. You can get some research assistants  to manually download this data (poor guys), or you could use a web scraper!

In such situations (and when it works), Web Scraping feels like the next best thing since sliced bread.

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. This post details the process of using java and the selenium webdriver to scrape data and assemble a dataset.

Why Java

To be fair and honest, I have heard that there are really efficient libraries in Perl, Python and Ruby that can easily extract web data and if you are comfortable with any of these languages, this is the way to go! However, if you are like me, and you are good buddies with Java, then this tutorial is for you. So, why Java ? No. Other. Reason.than.I.Like.It.

Why Selenium WebDriver? – Wait .. What is it ?

So, selenium is an open source software application that is used to primarily test web applications. You can write test commands in selenese e.g open a page, click this button, on the new page, login, then go to messages, and send a new message. Yes, thats the power of selenium. Selenium WebDriver API allows you to programmatically navigate  a website, interact with it (clicks, text input, DOM traversal), and read the content of each page. You can read more on the SeleniumWebDriver here .

Should You Use Selenium ?

You must have figured out that Selenium wasn’t really built for web scraping. It was built to automate website tests. There are other libraries that can be used to simply pull web content (e..g Jsoup Parser is recommended) . However, there are situations in which nothing else will work (without you pulling out your hair) . Here are two main use cases for selenium

– Dynamic Content
When you can’t get a direct link to the data you require. e.g the page contains dynamic content that is updated via javascript after some user interaction. Thus, using a simple html page request to crawl the page will only get the version before user interaction. Note that sometimes it is possible to view the url of the underlying ajax call being made during user interaction and directly crawl that link, but this can be a problem if the calls are authenticated.

– Authentication
A webpage that requires authentication. Imagine you need to scrape data from a forum that requires a login ? Or you want to get all your own facebook posts .. or the post of an agreeably party?  And each request  (e.g view forum photos, discussions etc) after the login is also authenticated. Well, it may be possible to decipher the authentication protocol used, and send the appropriate authentication data with each call, but that process can be long. Really long. Or you can instruct selenium to act as a sneaky web browser (manipulated by you) to do the dirty work. The browser (selenium) manages all the authentication protocol stuff once you have provided valid credentials.

If the two cases above are not applicable to you (i.e you aren’t scraping dynamic content, and there is no authentication ), then please do not use selenium.  You will be better served (speed) by other web parsing tools.

You can use html parsers in your favourite languge. Jsoup (Java) , Ruby, etc.

Steps to Scraping Data Using Selenium and Java

Step 1 – Download and Install Eclipse IDE (+ Maven)

The Eclipse IDE makes java (and other language) development easier. It also comes with useful tool called Maven. You may download eclipse here.

Maven is a “build management tool”, it is for defining how your .java files get compiled to .class, packaged into .jar (or .war or .ear) files, (pre/post)processed with tools, managing your CLASSPATH, and all others sorts of tasks that are required to build your project. It is similar to Apache Ant or Gradle or Makefiles in C/C++, but it attempts to be completely self-contained in it that you shouldn’t need any additional tools or scripts by incorporating other common tasks like downloading & installing necessary libraries etc. More here

If for some reason, your eclipse does not have Maven installed, Please follow these tutorials on installing the Maven Plugin with eclipse.

Step 2 – New Maven Project

After a successful install of maven in your Eclipse IDE, you should be able to create a New Maven Project
maven

mav1

Next, specify the maven repository parameters. These parameters can be obtained from the Selenium Website.

mav2

You should now have a project named selenium-java in your Eclipse project explore with a a generated pom.xml file. To learn more about the pom.xml file see here.

Modify your pom.xml file to the following content.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>MySel20Proj</groupId>
        <artifactId>MySel20Proj</artifactId>
        <version>1.0</version>
        <dependencies>
            <dependency>
                <groupId>org.seleniumhq.selenium</groupId>
                <artifactId>selenium-java</artifactId>
                <version>2.44.0</version>
            </dependency>
            <dependency>
                <groupId>com.opera</groupId>
                <artifactId>operadriver</artifactId>
            </dependency>
        </dependencies>
        <dependencyManagement>
            <dependencies>
                <dependency>
                    <groupId>com.opera</groupId>
                    <artifactId>operadriver</artifactId>
                    <version>1.5</version>
                    <exclusions>
                        <exclusion>
                            <groupId>org.seleniumhq.selenium</groupId>
                            <artifactId>selenium-remote-driver</artifactId>
                        </exclusion>
                    </exclusions>
                </dependency>
            </dependencies>
        </dependencyManagement>
</project>

 

To download all selenium dependencies,
Right click on your project > Run As > Maven Install
Right click on your project > Run As > Maven Build
Right click on your project > Maven Update Project > Tick offline box > OK

mavin

Maven will automatically download and install ALL dependencies you need to get started.  AWESOME!!!
mavin

You can go to your source folder … and create a new java class. In this class, you can now write code to launch a browser and start doing some magic.

Lets Scrape Some Data

We will create a quick class to load the denvycom blog home page, perform a search and print out the title of all blog posts listed on the home page. This code is extendable to other tasks such as logging into a forum, or social network and listing content. You can inspect the html content structure of each page you are about to scrape. This will guide you whilst writing code to extract the content of interest (titles). To inspect the page html you can right click on the page and in chrome select view source.

marv

we can see that our titles are within h2 tags with a page-header class. We use this to extract them.

package com.denvycom.seleniumtest;
import java.util.List;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;
public class ScrapeData {

	public static void main(String[] args) {
		  // Create a new instance of the Firefox driver
        // Notice that the remainder of the code relies on the interface,
        // not the implementation.
        WebDriver driver = new FirefoxDriver();

        // And now use this to visit Google
        driver.get("http://denvycom.com/blog/");
        // Alternatively the same thing can be done like this
        // driver.navigate().to("http://www.google.com");

        // Find the Denvycom search input element by its name
        WebElement element = driver.findElement(By.id("s"));

        // Enter something to search for
        element.sendKeys("research");

        // Now submit the form. WebDriver will find the form for us from the element
        element.submit();

        // Check the title of the page
        System.out.println("Page title is: " + driver.getTitle());
        // Should see: "All Articles on Denvycom related to the Keyword "Research""

        //Get the title of all posts
        List<WebElement> titles = driver.findElements(By.cssSelector("h2.page-header"));
        List<WebElement> dates = driver.findElements(By.cssSelector("span.entry-date"));
        System.out.println(" =============== Denvycom Articles on Research ================= ");
		for (int j = 0; j < titles.size(); j++) {
			System.out.println( dates.get(j).getText() + "\t - " + titles.get(j).getText() ) ;
		}

        //Close the browser
        driver.quit();

	}

}

In this code … first, we find the search element box

// Find the Denvycom search input element by its name
        WebElement element = driver.findElement(By.id("s"));

And then we “type” in some text

 // Enter something to search for
        element.sendKeys("research");

and hit the submit button

// Now submit the form. WebDriver will find the form for us from the element
        element.submit();

And finally we read the results  ….

//Get the title of all posts
        List<WebElement> titles = driver.findElements(By.cssSelector("h2.page-header"));
        List<WebElement> dates = driver.findElements(By.cssSelector("span.entry-date"));

You can easy see how you can use same code to login to a website .. i.e find the username and password elements , type in valid credentials, hit submit, and read the results!

Your output should look like this.

You can then save it to a database or dataset for later analysis.

Other Tips to Optimize Your Code

It can be slow, but if written in an optimal manner, there may be ways to improve your speed. Note that one of the biggest speed bottlenecks is the time taken to make each webpage request – focus your optimization effort on this. Some tips/examples.

– Match Elements Efficiently : Inspect CSS elements carefully on pages to improve matching. Use matches that resolve to only one element on the page if possible
– Do not load multiple pages if you can get all your data from one : For example, some forum pages have links for “load the next 20 comments”  . You could write code to “click” this link till you get to the end of the comment list. Or you could find out if the url parameter allows you load an entire data set ? /forum/forum-phones?load=50&page=1
In the above tweaking the “load=50″ to load=”size of dataset” means you make one call rather than multiple calls.
Disable image loading : To improve url load times,  you can disable images in firefox.
– Carefully Construct Your request URLs : So, each page your scrape is based on some request to load a given URL. The takeaway here is to use your knowledge of the website structure and to the number of  request you make. For example, I had assembled a link to all the groups on a forum but also wanted to get the members of each group. My code was first loading each group page and then clicking the links to the member pages.

Rather than making two URL calls (one to load group page, and one to load member page), a better way to do this would have been to automatically construct the member page (in this case I could) grouplink + “/members”  and make a single call. I could do this because from observation, this link followed that structure.

– Don’t be evil : Take care to ensure your scraper robot behaves like a regular user and doesn’t mess up the server/api.  Generally, I give a few seconds of pause between calls. Some api’s are badly written and can break based on malformed url requests, extreme crawling etc.

Finally …

Have you tried selenium webdriver ? Was it useful ? How did you manage its lack of speed ?

About Vykthur

Mobile and Web App Developer and Researcher. Passionate about learning, teaching, and recently - writing.
This entry was posted in PhD, Research and tagged , , . Bookmark the permalink.