Richard Liu
Richard Liu

Hack Submission Results with Python

Front

As of the writing of this article, Luogu has accumulated more than 3m submissions. This article uses Python to crawl some submitted samples for subsequent data analysis.

Crawl page

Luogu’s xth submission record, the URL is https://www.luogu.com.cn/record/{id}, but when using the requests library to directly GET the request, an error is returned. The reason is that you need to log in first. Therefore, the user-agent and cookie are obtained from chrome, which fakes chrome to issue access.

A commit record

url = f'https://www.luogu.com.cn/record/{id}'
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
    'cookie':'**********************'
}
print(url)
r = rq.get(url, headers=headers, timeout = 5)

GET return data

Luogu uses a strict separation of front and back ends to develop the website, which is very helpful for our crawlers.

Next, we only need to get the data returned in that large piece of js. The beautifulsoup is used here, and the code is as follows:

soup = BeautifulSoup(r.text,'html.parser')
res = soup.script.get_text()
res = unquote(res.split('\"')[1])

data = json.loads(res)
data = data['currentData']['record']

So res is the code within the double quotes in the script, which is the data we want to crawl. It is a json format, so use json.loads() to get a dict that contains the required information.

Store data

This time we use MongoDB. The initialization is as follows:

myclient = pymongo.MongoClient("mongodb://********:27017/")
mydb = myclient['luogu']
mycol = mydb['records']

The server of MongoDB is deployed using docker, which is very simple and only needs to expose one port to go out.

Storing data is also very simple:

if list(mycol.find({'id': id})) == []:
    mycol.insert_one(data)
else:
    print("Already")
To read the data, just

print(list(mycol.find()))
Multi-threaded crawling
Use threading library to crawl web pages in multiple threads.

Each worker is responsible for crawling 100 web pages; a total of 100 workers participate in the work.

def worker(x):
    print(f'worker {x} start')
    for i in range(x, x+100):
        try:
            getRecord(i)
        except Exception:
            pass
    print(f'worker {x} stop')

if __name__ =='__main__':
    pool = [threading.Thread(target=worker, args=[31000000 + x*100]) for x in range(100)]

    for x in pool:
        x.start()
    for x in pool:
        x.join()

At this point, the crawler is complete:

import requests as rq
from urllib.parse import unquote
from bs4 import BeautifulSoup
from re import findall
import json
import pymongo
import threading

myclient = pymongo.MongoClient("mongodb://*******:27017/")
mydb = myclient['luogu']
mycol = mydb['records']

def getRecord(id):
    url = f'https://www.luogu.com.cn/record/{id}'
    headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
    'cookie': '**********'
    }
    print(url)
    r = rq.get(url, headers=headers, timeout = 5)
    soup = BeautifulSoup(r.text, 'html.parser')
    res = soup.script.get_text()
    res = unquote(res.split('\"')[1])

    data = json.loads(res)
    data = data['currentData']['record']
    # print(data)

    if list(mycol.find({'id' : id})) == []:
        mycol.insert_one(data)
    else:
        print("Already")

def worker(x):
    print(f'worker {x} start')
    for i in range(x, x+100):
        try:
            getRecord(i)
        except Exception:
            pass
    print(f'worker {x} stop')

if __name__ == '__main__':
    pool = [threading.Thread(target=worker, args=[31000000 + x*100]) for x in range(100)]

    for x in pool:
        x.start()
    for x in pool:
        x.join()

Leave a Reply

textsms
account_circle
email

Richard Liu

Hack Submission Results with Python
Front As of the writing of this article, Luogu has accumulated more than 3m submissions. This article uses Python to crawl some submitted samples for subsequent data analysi…
Scan QR code to continue reading
2021-05-20