Web Scraping for Open Source Intelligence

Ever have to scroll through multiple pages on Google to get the information you want, or have to gather information from multiple websites? The process of manually going through each and every page of a website, picking out information you feel is relevant to your needs can be long, tedious, and boring. There has to be a better way! In comes the wonderful art of web scraping.

Web scraping is a way to automate the process of going through a website and picking out the information that you need. This article will detail how to go about scraping the web and what web scraping can be used for.


You can use most scripting languages for web scraping, but this article will focus on Python. You will need to make sure the following libraries are installed before we begin:

* BeautifulSoup (bs4)

* Requests

* lxml


You can install these in python by using the pip command:

    pip install <library to install>

Once you have all of the needed libraries we can start scraping websites. We will use this site, blog.vulsec.com, as our example site we are going to scrape. Once you are on the homepage of the website go ahead and view the source code of the page by using ctrl+u. Using this view will help us significantly with scraping the site. Go ahead and create a blank python file and import our needed libraries at the top of it as follows:

    from bs4 import BeautifulSoup
import requests

Now lets say we want our script to be able to grab every article located on the homepage of this blog and retrieve information about them. We first need to retrieve the HTML content of the blog before we can do anything else. To do that we will use the requests library as follows:

    response = requests.get('http://blog.vulsec.com')

The response variable will hold things like the status code we got back from the request, any cookies that were recieved from the page, the url of the page (useful in case a link redirects you), and the HTML content of the page. We can create a BeautifulSoup object from this content using the following line of code:

    soup = BeautifulSoup( response.content, 'lxml' )

We pass in the response's html content as that is the data we wish to parse. The 'lxml' argument tells BeautifulSoup how we wish the parse the content. I prefer the lxml style but there are other styles you can choose from. BeautifulSoup's documentation tells us what parsers are available and the advantages of each one.

Parser Typical usage Advantages Disadvantages
Python’s html.parser BeautifulSoup(markup, "html.parser")
  • Batteries included
  • Decent speed
  • Lenient (as of Python 2.7.3 and 3.2.)
  • Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser BeautifulSoup(markup, "lxml")
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parser BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5lib BeautifulSoup(markup, "html5lib")
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

Moving on, the next step for us is to grab the links for each article that is listed on the homepage of the blog. We can grab every link from an HTML page by using the command:

links = soup.findAll( 'a' )

This returns a list of all the links on a page. The 'a' argument we passed in is the html tag we are trying to find ( <a href="..." ...> ). From this we can loop through each one and view the title of each link.

    for link in links:
print(link.text)

We can look at our results and see... it's not exactly what we wanted.

image.png

Our script has grabbed every single link from the page, not just the ones for the articles. We can see that some of the results are correct, but then we get links for the home button, products button, etc. There has to be a way for us to filter these results.

If we go back into the HTML view of the website we can try to find the links to the articles and see how they are formatted. Scrolling down to them we can see that each post is surround by a <div> tag with the class name 'post-item'.

image-1.png

From this we can grab individual sections related to what we want. We could grab the <h2> tag to get the title of the article, we can grab the link with the class 'author-link' to get who wrote the article, and we could get the links with the class 'topic-link' to get the tags for the article. Lets start with the title for now. Go ahead and replace our code for grabbing links on the page and replace it with code that will grab those divs.

    divs = soup.findAll( 'div', {'class' : 'post-item'} )

Note the inclusion of the class data in the tag. We can add any extra information related to the tag we want by formatting it as json in the second argument for the findAll function. Now we can loop through the divs and pick out the headers to get the titles for each article.

    for div in divs:
header = div.find('h2')
print( header.text )

Our div object works exactly like the soup object. We can use commands like find and findAll to sort through them and get the tags or information we want. If you go ahead and run the script you can see that the articles' titles were outputted.

image.png

Now we can start grabbing other information and maybe toss all of this data into a python dictionary for easier use. I'm going to define an empty list at the beginning of the code by using the code:

posts = []

Now we're going to edit our loop to create a new object that will hold our post data, grab all of the information we want (in this case it will be title, author, and tags) and toss those into our object. Here is what our new loop looks like.

    for div in divs:
post = {}
header = div.find('h2')
post['title'] = header.text
author = div.find('a', {'class':'author-link'})
post['author'] = author.text
tags = div.findAll('a', {'class':'topic-link'})
post_tags = []
for tag in tags:
post_tags.append(tag.text)
post['tags'] = post_tags
print(post)

The ouput of this gives our post information in json format.

image-2.png

This format makes our data very easy to use. If we wanted the title of the post then we just make a call to post['title'].

We need the author?

post['author']

How about the tags?

post['tags']

You get the picture. This is the very basics of web scraping. We are just grabbing easy-to-find data that is located on the front page of this blog. You can spice this up by then going to each link by doing another call to requests and scraping each post for whatever information you want. The best way to learn web scraping is to experiment with it, see what things are available to you via BeautifulSoup and what you can use each feature for.

You can use web scraping to automate tedious web searches for Open Source Intelligence. Open Source Intelligence (OSINT) is data that is collected from publicly available sources. This can include social media accounts, public job listings, publicly available information on a comapny's site, etc. Web scraping makes it easy to collect all of this information as we don't have to manually go through multiple websites ourselves.

Hopefully this information will prove useful to readers and will convince them to get into web scraping themselves as it is a great skill to have. Our product Halogen utilizes the power of web scraping in order to gather OSINT data for our clients. I advise you to visit the Vulsec site to learn more.

Thanks for reading!