Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive select rows where string in column A contains string from column B

I have a dataset that contains some rows with invalid entries in one column.

I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.

Pseudo code might look something like this:

SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)

I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.

The python code would look like this:

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

Which would result in something like this:

df

a                         b        c
/bobs/burgers/1234        bobs     idx 
/bobs/burgers/2234        fred     idx
/cindys/sandwhiches/3234  cindy    idx

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

a                         b        c
/bobs/burgers/1234        bobs     idx
/cindys/sandwhiches/3234  cindy    idx

Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?

like image 829
r1ty Avatar asked Nov 16 '25 18:11

r1ty


2 Answers

In addition to like and rlike, you can use these functions:

Using instr:

select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;

Using locate:

select distinct b,c from some.log where date = current_date and locate(b, a) > 0;

See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

like image 159
leftjoin Avatar answered Nov 19 '25 09:11

leftjoin


LIKE or RLIKE

select distinct b,c from some.log where date = 'today' and a like concat('%',b,'%');

Or

select distinct b,c from some.log where date = 'today' and a rlike concat('*',b,'*');
like image 31
nobody Avatar answered Nov 19 '25 08:11

nobody



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!