Scrapy!

2 minute read

Published: March 03, 2019

Scrapy, or Crawler, is a technique which every qualified programmer should master. The applications of scrapy covers a wide field, say gathering all the stock prices or getting all the information of a website. Basically, search engines of Google and Baidu are giant scrapy.

Today I will show you how to build a simple scrapy using Python and a demo to scrap a website.

Technique stack of Scrapy

You are expected to have some basic knowledge about python, web and computer system.

Python libraries
- Request
- Beautifulsoup
- lxml
- Scrapy
- …
Regular expression
- To find the element containing the content you want
Http request
- To fake a request to the website
Ajax
- Most current website is rendered by Javascript
Multi-thread
- To concurrently download pages

Demo

Download html directly

A friend of mine works for FOSUN and her group wanna analyze some certain data from http://125.35.6.80:8181/ftban/fw.jsp. The website looks like this.

First let’s take a look at the html of the website. We use a light-weighted plug-in called Inspect.

It took less than a second to locate the element we need. It seems like just downloading the html and parse the html then we will make it.

I downloaded the page using Beautiful soup. What I got is the following bullshit. Obviouly, the website is rendered by JS so the current html shown by Inspect and the html downloaded by code is different.

<ul class="gzlist" id="gzlist">
<li style="margin-left: 468px;background-image: none;"><img height="28" src="http://125.35.6.80:8181/ftban/images/ajax.gif" style="vertical-align: middle;" width="28"/></li>
</ul>

Ajax reverse engineering

I clicked on the second page then the console presented more information.

The http request body seems well. Therefore, I verified using Postman.

Bingo! We made it! The website builder is not a clever man.

Some more details

At first I didn’t add User-agent in the HTTP request header so it failed. After adding User-agent (sth like ‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50’), it worked.
If you scrap too fast, some reuest will be declined. I checked weather the response was a JSON file or not, if not, I let the process sleep for 3 seconds and repeat crawlling the page.
The output is expected to be stored in a EXCEL file. The maximum line of a EXCEL file is about 1.04 million. Nevertheless, the whole data is about 1.76 million. So I have to store the data in more than a EXCEL file.
After speculation, time consumed by only one process is about 5000 min which is so long. So I let 3 threads run together. The idea is naive – each thread is responsible for the one-third of the data.

Share on

Twitter Facebook LinkedIn

Zhengyu Wu

Scrapy!

Technique stack of Scrapy

Demo

Download html directly

Ajax reverse engineering

Some more details

Share on

You May Also Enjoy

Leetcode 08

Leetcode 08

Leetcode 07

Leetcode 06