Scrapy!
Published:
Scrapy, or Crawler, is a technique which every qualified programmer should master. The applications of scrapy covers a wide field, say gathering all the stock prices or getting all the information of a website. Basically, search engines of Google and Baidu are giant scrapy.
Today I will show you how to build a simple scrapy using Python and a demo to scrap a website.
Technique stack of Scrapy
You are expected to have some basic knowledge about python, web and computer system.
- Python libraries
- Request
- Beautifulsoup
- lxml
- Scrapy
- …
- Regular expression
- To find the element containing the content you want
- Http request
- To fake a request to the website
- Ajax
- Most current website is rendered by Javascript
- Multi-thread
- To concurrently download pages
Demo
Download html directly
A friend of mine works for FOSUN and her group wanna analyze some certain data from http://125.35.6.80:8181/ftban/fw.jsp. The website looks like this.
First let’s take a look at the html of the website. We use a light-weighted plug-in called Inspect.
It took less than a second to locate the element we need. It seems like just downloading the html and parse the html then we will make it.
I downloaded the page using Beautiful soup. What I got is the following bullshit. Obviouly, the website is rendered by JS so the current html shown by Inspect and the html downloaded by code is different.
<ul class="gzlist" id="gzlist">
<li style="margin-left: 468px;background-image: none;"><img height="28" src="http://125.35.6.80:8181/ftban/images/ajax.gif" style="vertical-align: middle;" width="28"/></li>
</ul>
Ajax reverse engineering
I clicked on the second page then the console presented more information.
The http request body seems well. Therefore, I verified using Postman.
Bingo! We made it! The website builder is not a clever man.
Some more details
- At first I didn’t add User-agent in the HTTP request header so it failed. After adding User-agent (sth like ‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50’), it worked.
- If you scrap too fast, some reuest will be declined. I checked weather the response was a JSON file or not, if not, I let the process sleep for 3 seconds and repeat crawlling the page.
- The output is expected to be stored in a EXCEL file. The maximum line of a EXCEL file is about 1.04 million. Nevertheless, the whole data is about 1.76 million. So I have to store the data in more than a EXCEL file.
- After speculation, time consumed by only one process is about 5000 min which is so long. So I let 3 threads run together. The idea is naive – each thread is responsible for the one-third of the data.