Multi-threaded intelligent scheduling of high-anonymity crawler system

An intelligent scheduling and crawler system technology, applied in the field of computer networks, can solve problems such as being blocked, account blocked, website restrictions, etc., and achieve the effect of fast and efficient aggregation and improved crawling efficiency

Inactive Publication Date: 2019-03-22
NANJING UNIV OF POSTS & TELECOMM
View PDF5 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditional web crawling methods are often blocked when the website has a certain "anti-crawling" strategy, especially when crawling websites that need to be logged in, such as GitHub, Weibo, etc., we can access some web pages and request some interfaces, but not There are some disadvantages in crawling directly after logging in

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-threaded intelligent scheduling of high-anonymity crawler system
  • Multi-threaded intelligent scheduling of high-anonymity crawler system
  • Multi-threaded intelligent scheduling of high-anonymity crawler system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0037] see figure 1 As shown, the present invention discloses a high-aware crawler system with multi-thread intelligent scheduling, which is used for efficient crawling when the target website has a certain "anti-crawling strategy", improving the crawling efficiency and robustness of the crawler. performance and stability in a distributed crawler system environment, and then quickly and efficiently aggregate web page information and build a huge retrieval library.

[0038] The high-aware crawler system with multi-thread intelligent scheduling mainly includes the following six modules: proxy IP pool module, Cookies pool module, resource scheduling module, multi-thread crawler module, task queue generation module and background management module.

[0

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Multi-threaded intelligent scheduling of high-anonymity crawler system, It mainly consists of six modules: Proxy IP Pool Module, Cookies pool module, Resource scheduling module, multi-thread crawler module, task queue generation module and background management module are connected/cooperated with each other to improve crawling efficiency, robustness and stability in distributed crawler system environment, and then quickly and efficiently aggregate web page information and build a huge retrieval database.

Description

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products