Sharing investing and trading ideas. Helping traders get started.

Sunday, June 28, 2009

Automate Excel Web Queries using a Crawler: Downloading Historical Data

In a previous article, I wrote about clearing the cache for your Excel web queries in VBA and disabling background queries. However, that method is less flexible and requires you to specific tables for Excel to extract the data from. Here, I recommend a method to perform web queries using a so-called "XMLHTTP" object, which is potentially much faster than a typical data connection. It downloads the html source of your data source as text, searches for a stipulated number of characters after a particular date, e.g. 25-June-09 and removes the html tag, leaving the data you want. The following example provides the necessary VBA code to crawl a hypothetical website, for historical financial data on the Dow Jones Industrial Average Index DJIA.

Step 1: Set up workbook

Download the component stocks of the DJIA and set up your spreadsheet like the picture below. Column A shall be your stock symbols, followed by the stock name and the URL where you want to download the data from for each stock. This spreadsheet assumes that the URL of your data source is structured in the following way.

e.g.^DJI, where the last four characters represent the stock symbol of each company.

Column E shows the dates that have the financial data you want. You should into Column E samples of data you want to extract. In this case, I input the respective dates I want to macro to search for. You have to check for yourself that the URL in Column C will have the dates that you want. Input an arbitrary value of 600 into Cell G2.

In a second sheet, fill the first row with the following headers, "Date, Company, Extracted Text," followed by "Open, High, Low, Close, Volume" or whatever order the data from your source comes with.

Step 2: Coding - Download Data

Copy the following VBA code in. My comments are in bold so that they stand out from the VBA coding. They will explain the macros along the way.

Sub getdata()
'Dim your variables
Dim url1 As String
Dim date1 As String
Dim http1 As Object
Dim start1 As Long
Dim length1 As Long

'Store the value in cell G2 earlier as length1
length1 = Sheet1.Range("G2").Value

'Sheet1.Range("C2:C32") is the list of URL from earlier
'rcell is a variable to refer to each URL in the list
For Each rcell In Sheet1.Range("C2:C32")
url1 = rcell.Value
'Let Excel know that http1 is a XMLHTTP object
Set http1 = CreateObject("MSXML2.XMLHTTP")
'The open method initializes a GET request
'from the WWW as specified by url1,
'The option FALSE makes sure that the download
'is completed before the macro continues.
http1.Open "GET", url1, False
'The send Method sends to the URL the request from
'the Open method and receives the response
'In this case, the response is a html file, and the line
'below stores the html code as text in the variable
text1 = http1.responseText
'Format Sheet2
'Pasting the dates and quote symbols over
With Sheet2
lastrow = .Cells(1000000, 1).End(xlUp).Row + 1
Sheet1.Range("E2:E20").Copy .Cells(lastrow, 1)
lastrow1 = .Cells(1000000, 1).End(xlUp).Row
Range(.Cells(lastrow, 2), .Cells(lastrow1, 2)).Value = rcell.Offset _
(0, -2).Value
End With
'Below does this: For each data listed in Range("E2:E2o")
'Find the following string, e.g. ">5-Jun-09"
'Adding ">" in front makes sure that you get 5-Jun-09 data not
'25-Jun-09 data. This assumes that your data source html file
'surrounds your dates with in a table with lots of
For Each bcell In Sheet1.Range("E2:E20")
b = Len(Day(bcell.Value))
If b = 1 Then
date1 = ">" & Format(bcell.Value, "d-mmm-yy")
date1 = ">" & Format(bcell.Value, "dd-mmm-yy")
End If
start1 = InStr(1, text1, date1, vbTextCompare)
'Once the date is found in the html, store the next
'600 characters as specified in the column
'Extracted Text
lastrow = Sheet2.Cells(1000000, 3).End(xlUp).Row + 1
Sheet2.Cells(lastrow, 3).Value = Mid(text1, start1 + 1, length1)
'loop to next date but stay in same url
Next bcell
'loop to next url/company
Next rcell

'Call the macro to do the formating work
Call formating1
End Sub

Sub formating1()
'Assuming that the data you want is bounded by
'html tags e.g.
'The replace method below replaces the html tags
'with /
Sheet2.Cells.Replace What:="<*>", Replacement:="/", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _

'This method removes the broken
'at the last few characters of the 600 character string
Sheet2.Cells.Replace What:="<*", Replacement:="", _
LookAt:=xlPart, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, _

'converts your data to columns, treating consecutive // as
'one delimiter.
Sheet2.Columns(3).TextToColumns , DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=False, _
Semicolon:=False, Comma:=False, Space:=False, Other:=True, OtherChar _
End Sub

Your output should look like below. As 600 characters were extracted for each date, some excess data can be found from Column J onwards. These data can be further screened to see if the same date, e.g. 3 Jun 09 for BAC, appears twice, once for the price data and once for the dividend data. You will notice that this macro will run considerably faster than its conventional web query equivalent.

Indeed, this macro works more like a web crawler and can be easily amended to retreive live quotes. It resolves the bugs that arise when looping through multiple web queries. AND it is more flexible. The sample code and file can be downloaded from the links below. Remember to change the URL links. I left them as Click here to view Method A.

[Latest Note: Kind miken in the comments section below alerted me to some issues with this method. I did some checking. Firstly, google has amended its URL for price data. As such, the sample spreadsheet I provided may require some updating on your own. 

Secondly, the sample spreadsheet is saved in an older version of excel which only had 65,000 plus rows. I coded it to read row 1million as the last row as I was using a new version of excel. You will have to change that part of the code as well. 

Therefore, the code below and the sample spreadsheet are only for your reference. You probably will have to amend it to achieve the results you want. 

I am deferring the update of this spreadsheet till later, as I am currently working on other parts of this website]

Note: Spreadsheets provided by are provided without warranty. Users should not be using them for any unlawful activity.


On the Open method

On the Send method

On the ResponseText property

Like what you have just read? Digg it or Tip'd it.
The objective of Finance4Traders is to help traders get started by bringing them unbiased research and ideas. Since late 2005, I have been developing trading strategies on a personal basis. Not all of these models are suitable for me, but other investors or traders might find them useful. After all, people have different investment/trading goals and habits. Thus, Finance4Traders becomes a convenient platform to disseminate my work...(Read more about Finance4Traders)


MIke Nixon said...

With Sheet2
lastrow = .Cells(1000000, 1).End(xlUp).Row + 1

stopped on the lastrow , ahh maybe not 1000000 rows ?!

shinju said...

Thanks. I have been away for some time. Will update this section of the website at a later time. Have updated this post to warn readers of issues.

Post a Comment