Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is google natural language returning an incorrect beginOffset for analyzed string?

I am using google-cloud/language api to make an #annotate call and analyze entities and sentiments from a csv of comments which I have taken from various online resources.

To begin with, the string I am trying to analyze includes commentId's so I reformat this:

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

So that it doesn't include any comment ID's:

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

After sending a request for google cloud/language to #annotate the text. I receive a response which includes various substrings sentiments and magnitudes. Each string is also given a beginOffset value, which relates to the strings index in the original string (the string in the request).

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

My aim is then to locate the original comment in the original string, which should be simple enough. Something like (originalString[beginOffset]).....

This value is incorrect!

I am assuming that they do not include certain characters, but I have tried a multitude of regexes and nothing seems to work perfectly. Does anyone have any idea about what might be causing the issue???

like image 470
LastMan0nEarth Avatar asked Oct 18 '22 17:10

LastMan0nEarth


1 Answers

I know this is an old question but the problem seems to persist even today. I have recently encountered the same issue and resolved it by interpreting Google's offsets as "byte offsets" rather than string offsets in the chosen encoding. Works great. I hope it helps someone.

The following is some C# code but anybody should be able to interpret it and recode in their own favorite language. If we assume that text is actually the sentiment text being analyzed then the following code transforms, Google's offsets into correct offsets.

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}
like image 166
wpfwannabe Avatar answered Nov 10 '22 20:11

wpfwannabe