Method and system for eliminating repeated pages from favorite webpages

A technology for eliminating repetition and web pages, applied in the field of Internet information, can solve the problems of increased time complexity, changes in punctuation character strings, and low judgment accuracy, so as to improve accuracy and efficiency, reduce time complexity, and improve algorithm efficiency. Effect

Inactive Publication Date: 2018-11-27
佛山市灏金赢科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] For example, the web page deduplication algorithm based on feature words is more complicated to select features and needs to consider more factors. At the same time, the comparison algorithm of feature words has a high time complexity. When the scale of web pages reaches hundreds of thousands, due to the need to The pairwise comparison of feature sentences in the set will lead to a sharp increase in time complexity
[0010] For example, the web page deduplication algorithm based on punctuation is only applicable to the case where the text of the web page contains punctuation marks, and the content will not change. If the content of the web page text changes (the sequence of sentences changes, etc.), the extracted punctuation feature string will change. lead to misjudgment
At the same time, there is also the problem of high time complexity in comparing feature strings
[0011] It can be seen that the comparison object of the existing scheme is the text of the webpage. If the text of the webpage is not extracted accurately and there is noise in the webpage, the accuracy of the judgment will not be high.
The method based on feature sentences needs to compare the feature sentences of the webpage to be judged with the set of feature sentences in the set of web pages. When the set size is large, the time complexity will be very high.
The punctuation-based de-duplication algorithm has a limited scope of application. When the sequence of sentences in the body of a webpage changes, the punctuation feature strings will change greatly, resulting in a decrease in accuracy. Comparing the two, the time complexity is higher

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for eliminating repeated pages from favorite webpages
  • Method and system for eliminating repeated pages from favorite webpages

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0046] figure 1 A schematic flowchart of a method for eliminating duplicate webpages from favorite webpages according to an embodiment of the present invention is shown. As shown in the figure, the present invention provides a method for eliminating duplicate web pages from favorite web pages, including the following steps:

[0047] S100: Obtain a favorite folder for favorite webpages, and obtain the source code of the favorite webpage from the favorite folder;

[0048] S200: Extract at least part of the body content of the webpage according to the source code;

[0049] S300: Perform similarity calculation on the at least part of the body content and corresponding content in the previously favorite webpage;

[0050] S400: When the similarity is greater than or equal to a preset similarity, delete the webpage corresponding to at least part of the body content.

[0051] Extracting at least part of the body content of the webpage according to the source code includes the following steps:

[0

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for eliminating repeated pages from favorite webpages. The method comprises the following steps that a favorite folder used for storing webpages is obtained, and sourcecodes of the favorite webpages are obtained from the favorite folder; at least part of body content in the webpages are extracted according to the source codes; similarity calculation is performed onthe at least part of the body content and the corresponding content in the pre-stored webpages; when the similarity is greater than or equal to a preset similarity, the webpages corresponding to theat least part of the body content are deleted. The method for eliminating the repeated pages from the favorite webpages has the advantages that through a webpage body content extraction method, the accuracy and efficiency of extraction are improved, thereby making the extraction of body content features more convenient and rapid, improving the efficiency of the algorithm, and reducing the time complexity of the pairwise comparison of feature strings.

Description

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Owner 佛山市灏金赢科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products