Venturing into Machine Learning, I quickly realized the need for good datasets. Web Scraping can come in very handy when datasets aren’t easily available.
It is important to use these web scraping bots in moderation and in accordance with the terms and conditions of the websites being scraped.
Here’s how to use
BeautifulSoup to get started scraping websites using python.
Install these packages if needed:
pip install requests pip install bs4
Here’s a code snippet that fetches five headlines from BBC News
import requests from bs4 import BeautifulSoup result = requests.get("http://www.bbc.com/news") soup = BeautifulSoup(result.content, "lxml") headlines = soup.find_all("h3")[:5] for headline in headlines: print headline.text
Here are the docs for BeautifulSoup.
These days, with many websites fetching data after initial page load, above method won’t cut it anymore.
selenium if needed:
pip install selenium
Here’s a code snippet that also fetches five headlines from BBC News
from selenium import webdriver driver = webdriver.Chrome("C:\path\to\chromedriver\chromedriver.exe") driver.implicitly_wait(5) # waiting 5 seconds for dynamic data to load driver.get("http://www.bbc.com/news") headlines = driver.find_elements_by_tag_name("h3")[:5] for headline in headlines: print headline.text driver.close()
Here are the unofficial docs for Selenium.