Method and device for training word vector model

A technology of word vector and model, applied in the field of deep learning, can solve the problem of low accuracy of word vector, and achieve the effect of improving accuracy

Inactive Publication Date: 2019-12-10
BEIJING SAMSUNG TELECOM R&D CENT +1
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

In this patented technology, two different types of data can be combined together in order to make it easier or more accurate than just looking at one type alone. These techniques help reduce overfitting problems that occur when training models with only one kind of data (target/context).

Problems solved by technology

This patented technical solution involves converting documents from different ways by embedding them within an artificial structure called a lattice. These structures contain knowledge about how they look like naturally occurring languages - even if there may have multiple versions of one type of document being translated at once. However, this method has limitations due to lack of sufficient detail regarding which parts of speech correspond best to certain terms found during training data collection.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for training word vector model
  • Method and device for training word vector model
  • Method and device for training word vector model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] The embodiment of the present application provides a method for training a word vector model, such as figure 1 shown, including:

[0053] Step S110: Obtain statistical information of text information in the text library, where the text information includes target text and context text. Specifically, the text information in the text database may be words, phrases, or n-grams in natural language, which is not limited in this application. The statistical information is first-order co-occurrence information, and specifically may be a co-occurrence matrix.

[0054] Further, the number of words in the dictionary can be 50,000, 100,000, 200,000, etc. When the number of words in the dictionary is 100,000, these 100,000 words will form a 100,000-line 100,000 A two-dimensional matrix of columns, where the value of row i and column j represents the co-occurrence statistics of vocabulary i and vocabulary j, that is, the number of times vocabulary i and vocabulary j co-occur in the t

Embodiment 2

[0063] The embodiment of the present application provides another possible implementation mode. On the basis of the first embodiment, the method shown in the second embodiment is also included, wherein,

[0064] Step S130 may include step 1301 (step not marked): using statistical information and point mutual information distribution overlapping information as training data, based on the target loss function, to train the preset word vector determination model.

[0065] At this time, it is joint training, that is, according to the first-order co-occurrence information, based on the first loss function, the first loss amount is obtained; according to the second-order co-occurrence information, based on the second loss function, the second loss amount is obtained; according to the first loss The amount and the second loss amount are used to train the word vector model.

[0066] Step S111 (not marked in the figure) is also included before step S130: determining the target loss functi

Embodiment 3

[0094] The embodiment of the present application provides another possible implementation manner. On the basis of the second embodiment, the method shown in the third embodiment is also included, wherein,

[0095] Step S120 may include step S1201 (not marked in the figure), step S1202 (not marked in the figure), step S1203 (not marked in the figure) and step S1204 (not marked in the figure), wherein,

[0096] Step S1201: Determine the context text sets corresponding to each target text;

[0097] Step S1202: According to each context text set, determine the intersection between any two context text sets;

[0098] Step S1203: According to the determined intersection between any two context text sets, determine the point mutual information between each target text and the context text in the intersection;

[0099] Step S1204: According to the determined point mutual information between each target text and the intersection, determine distribution overlap information, and obtain the

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of deep learning, and discloses a method and device for training a word vector model, and the method for training the word vector model comprises the steps: obtaining first information which is used for reflecting the correlation degree between a target text and a context text; obtaining second information, wherein the second information is used for reflecting the association degree between the target texts; and training a word vector model according to the first information and the second information to obtain a word vector of the target text. According to the method provided by the embodiment of the invention, missing statistical information of a large number of unobserved text information pairs is made up, the problem that a co-occurrence matrix is extremely sparse is alleviated, and the accuracy of the word vector determined by the word vector model is effectively improved.

Description

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Owner BEIJING SAMSUNG TELECOM R&D CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products