Scrapy!

2 minute read

Published:

Scrapy, or Crawler, is a technique which every qualified programmer should master. The applications of scrapy covers a wide field, say gathering all the stock prices or getting all the information of a website. Basically, search engines of Google and Baidu are giant scrapy.

Today I will show you how to build a simple scrapy using Python and a demo to scrap a website.

Technique stack of Scrapy

You are expected to have some basic knowledge about python, web and computer system.

  • Python libraries
    • Request
    • Beautifulsoup
    • lxml
    • Scrapy
  • Regular expression
    • To find the element containing the content you want
  • Http request
    • To fake a request to the website
  • Ajax
    • Most current website is rendered by Javascript
  • Multi-thread
    • To concurrently download pages

Demo

Download html directly

A friend of mine works for FOSUN and her group wanna analyze some certain data from http://125.35.6.80:8181/ftban/fw.jsp. The website looks like this.

First let’s take a look at the html of the website. We use a light-weighted plug-in called Inspect.

It took less than a second to locate the element we need. It seems like just downloading the html and parse the html then we will make it.

I downloaded the page using Beautiful soup. What I got is the following bullshit. Obviouly, the website is rendered by JS so the current html shown by Inspect and the html downloaded by code is different.

<ul class="gzlist" id="gzlist">
<li style="margin-left: 468px;background-image: none;"><img height="28" src="http://125.35.6.80:8181/ftban/images/ajax.gif" style="vertical-align: middle;" width="28"/></li>
</ul>

Ajax reverse engineering

I clicked on the second page then the console presented more information.

The http request body seems well. Therefore, I verified using Postman.

Bingo! We made it! The website builder is not a clever man.

Some more details

  1. At first I didn’t add User-agent in the HTTP request header so it failed. After adding User-agent (sth like ‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50’), it worked.
  2. If you scrap too fast, some reuest will be declined. I checked weather the response was a JSON file or not, if not, I let the process sleep for 3 seconds and repeat crawlling the page.
  3. The output is expected to be stored in a EXCEL file. The maximum line of a EXCEL file is about 1.04 million. Nevertheless, the whole data is about 1.76 million. So I have to store the data in more than a EXCEL file.
  4. After speculation, time consumed by only one process is about 5000 min which is so long. So I let 3 threads run together. The idea is naive – each thread is responsible for the one-third of the data.