An automatic bill downloader in Java

In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.

I suggest you to read these articles first : Introduction to web scraping with Java and Autologin

Since I am hosting this blog on Digital Ocean (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have.

Login

To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article.

Then we have to go to the bill page : https://cloud.digitalocean.com/settings/billing

String baseUrl = "https://cloud.digitalocean.com";  
String login = "email";  
String password = "password" ;

try {  
    WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);

    HtmlPage page = client.getPage("https://cloud.digitalocean.com/settings/billing");
    if(page.asText().contains("You need to sign in for access to this page")){
        throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
    }
}catch (Exception e) {
    e.printStackTrace();
}

Fetching the bills

Let's create a new Class called Bill or Invoice to represent a bill :

Bill.java

public class Bill {

    private String label ;
    private BigDecimal amount ; 
    private Date date;
    private String url ;
//... getters & setters
}

Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :

We are lucky here, it's a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these :

  • HtmlTable to store the table and iterate on each rows

  • getCell to select the cells

Then, using the Jackson library we will export the Bill objects to JSON and print it.

HtmlTable billsTable = (HtmlTable) page.getFirstByXPath("//table[@class='listing Billing--history']");  
for(HtmlTableRow row : billsTable.getBodies().get(0).getRows()){

    String label = row.getCell(1).asText();
    // We only want the invoice row, not the payment one
    if(!label.contains("Invoice")){
        continue ;
    }

    Date date = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH).parse(row.getCell(0).asText());
    BigDecimal amount =new BigDecimal(row.getCell(2).asText().replace("$", ""));
    String url = ((HtmlAnchor) row.getCell(3).getFirstChild()).getHrefAttribute();

    Bill bill = new Bill(label, amount, date, url);
    bills.add(bill);
    ObjectMapper mapper = new ObjectMapper();
    String jsonString = mapper.writeValueAsString(bill) ;

    System.out.println(jsonString);

It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the Pageobject to store the pdf, and call a getContentAsStreamon it. It's better to check if the file has the right content type when doing this (application/pdf in our case)

Page invoicePdf = client.getPage(baseUrl + url);

if(invoicePdf.getWebResponse().getContentType().equals("application/pdf")){  
    IOUtils.copy(invoicePdf.getWebResponse().getContentAsStream(), new FileOutputStream("DigitalOcean" + label + ".pdf"));
}

That's it, here is the ouput :

{"label":"Invoice for December 2015","amount":0.35,"date":1451602800000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for November 2015","amount":6.00,"date":1448924400000,"url":"/billing/XXXX.pdf"}
{"label":"Invoice for October 2015","amount":3.05,"date":1446332400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for April 2015","amount":1.87,"date":1430431200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for March 2015","amount":5.00,"date":1427839200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for February 2015","amount":5.00,"date":1425164400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for January 2015","amount":1.30,"date":1422745200000,"url":"/billing/XXXXXX.pdf"}
{"label":"Invoice for October 2014","amount":3.85,"date":1414796400000,"url":"/billing/XXXXXX.pdf"}

`

As usual you can find the full code on this Github Repo

I hope you enjoyed this post, please subscribe to my newsletter :)

Kevin Sahin

You want to work with me for advices / custom scraping work ? Just send me an email at hi@ksah.in :)

Paris