Stock price scraping can be a nightmare if the APIs you’re trying to use are not up to date. Few months ago I was looking for free sources to obtain one-min-level data. Apart of having troubles with the Yahoo Finance API (apparently non up-to-date by then) and having to tweak some code samples in GitHub to scrape Google Finance, I found the new and shining provider of real-time stock prices, Alpha Vantage.
So I decided to develop an script to download data once a week for backtesting. The Alpha Vantage API is as straightforward to use as seen below. It returns a Pandas dataframe that later we save in a gzip file. You only have to request a free API key here.
1 2 3 4 5 6 7 8 9 10 |
""" Retrieve intraday stock data from Alpha Vantage API. """ #Alpha Vantage API to download 15 days of minute data (only if required) from alpha_vantage.timeseries import TimeSeries apikey='YOUR_ALPHA_VANTAGE_API_KEY' # Get pandas object with the intraday data and another with the call's metadata ts = TimeSeries(key=apikey, output_format='pandas') |
For Google Finance, I had seen several code samples in GitHub and other blogs. I literally copied the code below from those, only changing the URL of the requests to make it work.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
""" Retrieve intraday stock data from Google Finance. """ import csv import datetime import re import pandas as pd import requests def get_google_finance_intraday(ticker, period=60, days=1, exchange='USD', debug=False): """ Retrieve intraday stock data from Google Finance. Parameters ---------- ticker : str Company ticker symbol. period : int Interval between stock values in seconds. days : int Number of days of data to retrieve. Returns ------- df : pandas.DataFrame DataFrame containing the opening price, high price, low price, closing price, and volume. The index contains the times associated with the retrieved price values. """ uri = 'https://finance.google.com/finance/getprices' \ '?&p={days}d&f=d,o,h,l,c,v&q={ticker}&i={period}?x={exchange}'.format(ticker=ticker, period=period, days=days, exchange=exchange) if(debug): print (uri) page = requests.get(uri) reader = csv.reader(page.content.splitlines()) columns = ['Open', 'High', 'Low', 'Close', 'Volume'] rows = [] times = [] for row in reader: if re.match('^[a\d]', row[0]): if row[0].startswith('a'): start = datetime.datetime.fromtimestamp(int(row[0][1:])) times.append(start) else: times.append(start+datetime.timedelta(seconds=period*int(row[0]))) rows.append(map(float, row[1:])) if len(rows): return pd.DataFrame(rows, index=pd.DatetimeIndex(times, name='Date'), columns=columns) else: return pd.DataFrame(rows, index=pd.DatetimeIndex(times, name='Date')) |
Having both functions that download the data in a dataframe, the only missing bit is the code that saves that dataframe in a CSV or a GZIP file. Though, I decided to complicate a bit the idea. The four functions below read a CSV of stock symbols and scrape+save each one from the source specified as parameter (Google Finance or Alpha Vantage).
download_list_of_prices receives a CSV file and a pointer to the last symbol downloaded (I normally use this for debugging) and tries to download each following symbol (listed in the CSV) an X number of times (that we define as input parameter); sometimes the Alpha Vantage service is not available and a good couple of tries in some symbols is required. That ‘trying’ is performed by the function try_download.
The function download_single_price_from saves each of the dataframes for each symbol iterated, not without creating a folder for that symbol if needed (using the function check_or_create_path).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
""" Download price for a given symbol using either Google Finance or Alpha Vantage """ def download_single_price_from(symbol,period=60,days=20,exchange='USD',site="google",debug=True, \ path="default" ,name="default"): #real max days at 1min level is 15... df = pd.DataFrame({'A' : []}) site_option = "" # Download index price if site=="google_finance": df=get_google_finance_intraday(symbol,period,days,exchange,debug) elif site=="alpha_vantage": df, meta_data = ts.get_intraday(symbol, interval='1min', outputsize='full') # Save index prices output_file=check_or_create_path(path)+"/"+name+"_"+str(datetime.date.today())+".csv.gz" df.to_csv(output_file, sep=';', compression='gzip') """ Trying to download the symbols '$MAX_TRIES' times. After, we assume that the symbol is not available in the given API/Provider. """ def try_download(symbol,period,days,exchange,site,debug,path, name, tries_count): try: download_single_price_from(symbol=symbol,period=period,days=days,exchange=exchange,\ site=site,debug=debug,path=path,name=symbol) except: # catch *all* exceptions e = sys.exc_info()[0] print( "<p>Error: %s</p>" % e ) # Recursive function that tries again from pointer when crashing (awaits 5 seconds to retry) if (tries_count<MAX_TRIES): time.sleep(5) print ("Trying again... try #"+str(tries_count+1)+" of a maximum of "+str(MAX_TRIES)) try_download(symbol,period,days,exchange,site,debug,path, name, tries_count=(tries_count+1)) else: print ("Maximum number of attemps for symbol: "+symbol+" from "+site+ \ ". Check in path '"+path+"' required.") """ Download price for all the symbols of a given list """ import os import sys import time def download_list_of_prices(root_path, list_file, symbols_subpath, \ period=60,days=20,exchange='USD',site="google",\ debug=True, from_symbol=''): # 1 Initialize pointer download_symbol=False if from_symbol != '' else True # 2 Load list of stocks symbols=pd.read_csv(root_path+"/"+list_file, sep=';', parse_dates=True, infer_datetime_format=True) # 3 Check output root paths (and create them if needed) symbols_full_path=\ check_or_create_path(str(check_or_create_path(root_path+"/"+site+"/"+symbols_subpath))+"/symbols") counter=0 # 4 Go through list of prices for symbol in symbols['symbol'].tolist(): counter=counter+1 # Looking for start symbol to start downloading # An start symbol should be provided for instance in case of a failed download # This is only to avoid starting from scratch in those cases. if (download_symbol or symbol==from_symbol): download_symbol=True # Print current symbol to be downloaded as feedback. # It also prints the symbol index in the whole list of symbols to download. print ""+str(counter)+"/"+str(len(symbols['symbol'].tolist()))+" = "+symbol # Trying to download the symbols 10 times. try_download(symbol=symbol,period=period,days=days,exchange=exchange,\ site=site,debug=debug,path=symbols_full_path+"/"+symbol+"/",\ name=symbol, tries_count=0) print("Done!") """ Check if the path does exist. If it doesnt, create a folder for the given symbol """ def check_or_create_path(path): directory = os.path.dirname(path) try: os.stat(directory) except: os.mkdir(directory) return path |
Finally, we only need to run those functions. I’d suggest to use try_download, as seen below, to download a single symbol but trying a number of times. Then download_list_of_prices if we have a CSV ready with several symbols that we want to trigger.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
if __name__ == "__main__": # Download SPX index for site in ["google_finance","alpha_vantage"]: #download_single_price_from("SPX", 60, 20, "USD", site ,True, RAW_DATA_PATH+site+"/"+"S&P500/index","S&P500") try_download("SPX", 60, 20, "USD", site ,True, RAW_DATA_PATH+site+"/"+"S&P500/index","S&P500", 0) # Download all SPX stocks from_symbol="" #"LUK" #DAL" for site in ["google_finance","alpha_vantage"]: download_list_of_prices(root_path=PATH+"/raw",list_file="SPX_list.csv",symbols_subpath="S&P500", \ period=60,days=20,exchange='USD',site=site,debug=False,from_symbol=from_symbol) # Download SPY ETF for site in ["google_finance","alpha_vantage"]: try_download("SPY",60,20,"USD",site,True,RAW_DATA_PATH+site+"/"+"S&P500/spy_eft","SPY",0) # Download EURUSD for site in ["google_finance","alpha_vantage"]: try_download("EURUSD",60,20,"USD",site,True,RAW_DATA_PATH+site+"/"+"EURUSD","EURUSD",0) # Download Bitcoin-USD for site in ["google_finance","alpha_vantage"]: try_download("BTCUSD",60,20,"USD",site,True,RAW_DATA_PATH+site+"/"+"BITCOIN","BTCUSD",0) # Download NASDAQ for site in ["google_finance","alpha_vantage"]: try_download("IXIC",60,20,"USD",site,True,RAW_DATA_PATH+site+"/"+"NASDAQ/index","NASDAQ",0) # NASDAQ EFTs to track volumes try_download("QQQ",60,20,"USD",site,True,RAW_DATA_PATH+site+"/"+"NASDAQ/qqq_eft","QQQ",0) |
In the next post of this series we’ll see how to model a dataset of technical indicators by using prices and volumes.