Count the amount of times a word appears in BigQuery column

0

I have a column with some long strings and need to count the most used words in it.

I need something that works like this https://towardsdatascience.com/very-simple-python-script-for-extracting-most-common-words-from-a-story-1e3570d0b9d0. The word counting part at least...

And it is very important that i have the option to blacklist some words so they dont count.

google-bigquery
2021-11-23 18:33:36
1

2

Try below simple approach

with blacklist as (
  select 'with' word union all 
  select 'that' union all
  select 'add more as you see needed'
)
select lower(word) word, count(*) frequency
from data, unnest(regexp_extract_all(col, r'[\w]*')) word
where length(word) > 3  
and word not in (select word from blacklist)
group by word
order by frequency desc     

 
2021-11-23 22:40:30

it didnt work... the phrases are in portuguese, could this be the problem? or maybe i didnt make the right substituion on your code idk
Murilo

), blacklist as ( select 'with' word union all select 'that' union all select 'add more as you see needed' ) select lower(word) word, count() frequency from T0, unnest(regexp_extract_all(T0.column, r'[\w]')) word where length(word) > 3 and word not in (select word from blacklist) group by word order by frequency desc ///i tried this..
Murilo

please be more specific - what you mean by "it didnt work"? provide example of input data. etc....
Mikhail Berlyant

my bad, i receive this message "This query returned no results".
Murilo

never mind, i had a mistake on my original query, it works perfectly now, thank you so much
Murilo

Thank you for confirming. Glad it works for you. Consider also voting up the answer if it helped :o)
Mikhail Berlyant

btw, im looking at the results and the code is cutting words that contain some "brazilian letters" like "Ç" "ã" "õ", is there a way to make it consider those. In a word like "informação", it counts as "informa"
Murilo

sure doable, will check shortly. but meantime check my in my other answers how to treat accents, etc. It should be at least few answers related to that :o)
Mikhail Berlyant

In other languages

This page is in other languages

Русский
..................................................................................................................
Italiano
..................................................................................................................
Polski
..................................................................................................................
Română
..................................................................................................................
한국어
..................................................................................................................
हिन्दी
..................................................................................................................
Français
..................................................................................................................
Türk
..................................................................................................................
Česk
..................................................................................................................
Português
..................................................................................................................
ไทย
..................................................................................................................
中文
..................................................................................................................
Español
..................................................................................................................
Slovenský
..................................................................................................................