How to use Postgres jsonb_path_query instead of select union

0

db:Postgresql-14. This will be an infrequent transformation, and I'm looking for recommendations / improvements that can be made so I may learn/hone my postgres/json skills (and speed/optimize this very slow query).

We receive variable size/structure json objects from an external api.

Each json object is a survey response. Each nested "question/answer" object can have a quite different structure. In total there are about ~5 known structures.

Response objects are stored in a jsonb column that has a jsonb_ops gin index.

Table has about 500,000 rows. Each row's jsonb column object has about 200 nested values.

Our goal is to extract all the nested question/answer responses into another table of id,question,answer. On the destination table we'll be doing extensive querying with FTS and trigram, and are aiming for schema simplicity. That is why I'm extracting to a simple table instead of doing anything more exotic with jsonb querying. There is also a lot of metadata cruft in those objects that I don't need. So I'm also hoping to save some space by archiving the origin table (it's 5GB + indexes).

Specifically I'd love to learn a more elegant way of traversing and extracting the json to the destination table.

And I've been unable to figure out a way to cast the results to actual sql text instead of quoted jsontext (normally I'd use ->>, ::text, or the _text version of the jsonb function)

This is a very simplified version of the json object to ease just running this.

Thank you in advance!

create table test_survey_processing(
    id integer generated always as identity constraint test_survey_processing_pkey primary key,
    json_data jsonb
);
insert into test_survey_processing (json_data)
values ('{"survey_data": {"2": {"answer": "Option 1", "question": "radiobuttonquesiton"}, "3": {"options": {"10003": {"answer": "Option 1"}, "10004": {"answer": "Option 2"}}, "question": "checkboxquestion"}, "5": {"answer": "Column 2", "question": "Row 1"}, "6": {"answer": "Column 2", "question": "Row 2"}, "7": {"question": "checkboxGRIDquesiton", "subquestions": {"8": {"10007": {"answer": "Column 1", "question": "Row 1 : Column 1"}, "10008": {"answer": "Column 2", "question": "Row 1 : Column 2"}}, "9": {"10007": {"answer": "Column 1", "question": "Row 2 : Column 1"}, "10008": {"answer": "Column 2", "question": "Row 2 : Column 2"}}}}, "11": {"answer": "Option 1", "question": "Row 1"}, "12": {"answer": "Option 2", "question": "Row 2"}, "13": {"options": {"10011": {"answer": "Et molestias est opt", "option": "Option 1"}, "10012": {"answer": "Similique magnam min", "option": "Option 2"}}, "question": "textboxlist"}, "14": {"question": "textboxgridquesiton", "subquestions": {"15": {"10013": {"answer": "Qui error magna omni", "question": "Row 1 : Column 1"}, "10014": {"answer": "Est qui dolore dele", "question": "Row 1 : Column 2"}}, "16": {"10013": {"answer": "vident mol", "question": "Row 2 : Column 1"}, "10014": {"answer": "Consectetur dolor co", "question": "Row 2 : Column 2"}}}}, "17": {"question": "contactformquestion", "subquestions": {"18": {"answer": "Rafael", "question": "First Name"}, "19": {"answer": "Adams", "question": "Last Name"}}}, "33": {"question": "customgroupquestion", "subquestions": {"34": {"answer": "Sed magnam enim non", "question": "customgroupTEXTbox"}, "36": {"answer": "Option 2", "question": "customgroupradiobutton"}, "37": {"options": {"10021": {"answer": "Option 1", "option": "customgroupCHEC KBOX question : Option 1"}, "10022": {"answer": "Option 2", "option": "customgroupCHEC KBOX question : Option 2"}}, "question": "customgroupCHEC KBOX question"}}}, "38": {"question": "customTABLEquestion", "subquestions": {"10001": {"answer": "Option 1", "question": "customTABLEquestioncolumnRADIO"}, "10002": {"answer": "Option 2", "question": "customTABLEquestioncolumnRADIO"}, "10003": {"options": {"10029": {"answer": "OPTION1"}, "10030": {"answer": "OPTION2"}}, "question": "customTABLEquestioncolumnCHECKBOX"}, "10004": {"options": {"10029": {"answer": "OPTION1"}, "10030": {"answer": "OPTION2"}}, "question": "customTABLEquestioncolumnCHECKBOX"}, "10005": {"answer": "Aperiam itaque dolor", "question": "customTABLEquestioncolumnTEXTBOX"}, "10006": {"answer": "Hic qui numquam inci", "question": "customTABLEquestioncolumnTEXTBOX"}}}}}');
create index test_survey_processing_gin_index on test_survey_processing using gin (json_data);

-- the query I'm using (it works, but it is unmanageably slow)

-- EXPLAIN (ANALYZE, VERBOSE, BUFFERS, FORMAT JSON)
select level1.value['question'] question, level1.value['answer'] as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
     jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
     jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.options.*.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
     jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.subquestions.*.*.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
     jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4

FOLLOWUP EDIT AFTER REFINING AND GETTING THE RESULT I NEEDED

This is the query I ended up running. It took 11min to process and insert 34million records. Which is fine as it is one time operation.

A few comments about the changes I made

-I used -> and ->> instead of [subscripting] since I read that even in pg14, subscripting doesn't use indexes (not sure if that matters in the FROM)
-the "to_json(...) #>> '{}'" is how I converted the json string to an unquoted string based on this: stack overflow answer

create table respondent_questions_answers as
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question, '' as sub_question,
       to_json(jsonb_path_query(level1.value, '$.answer')) #>> '{}' as answer 
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
       to_json(jsonb_path_query(level1.value, '$.options.*.option')) #>> '{}' as sub_question,
       to_json(jsonb_path_query(level1.value, '$.options.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1 
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
       to_json(jsonb_path_query(level1.value, '$.subquestions.*.*.question')) #>> '{}' as sub_question,
       to_json(jsonb_path_query(level1.value, '$.subquestions.*.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
       to_json(jsonb_path_query(level1.value, '$.subquestions.*.question')) #>> '{}' as sub_question,
       to_json(jsonb_path_query(level1.value, '$.subquestions.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1;

Final edit after accepting below answer as solution

Thank's to @Edouard H. answer and with a better understanding of how to correctly use jsonb_path_query, I was able to eliminate all the UNION SELECT, discover some values that had been missing, and remove the need for the to_json hack. Even though the CROSS JOIN LATERAL is implicit with json functions, it is better form to include JOIN instead of commas as they are more tightly bound, and easier to read. Below is the final query I used.

SELECT concat_ws(' ',
    qu.value::jsonb->>'question'
,   an.answer::jsonb->>'question'
,   an.answer::jsonb->>'option') AS question
,   an.answer::jsonb->>'answer' AS answer
--      , tgsr.json_data->>'survey_data'
FROM test_survey_processing tgsr
         CROSS JOIN LATERAL jsonb_each(tgsr.json_data->'survey_data') AS qu
         CROSS JOIN LATERAL jsonb_path_query(qu.value::jsonb, '$.** ? (exists(@.answer))') AS an(answer)
json jsonb jsonpath postgresql
2021-11-22 19:30:04
1

0

First idea : remplace the 4 queries with UNION by 1 unique query.

Second idea : the statement level1.value['answer'] as answer in the first query sounds like the statement jsonb_path_query(level1.value, '$.answer')::jsonb as answer in the second query. I think both queries return the same set of rows, and the duplicates are removed by the UNION between both queries.

Third idea : use the jsonb_path_query function in the FROM clause instead of the SELECT clause, using CROSS JOIN LATERAL in order to break down the jsonb data step by step :

SELECT qu.question->>'question' AS question
     , an.answer->>'answer' AS answer
     , tgsr.json_data->>'survey_data'
  FROM test_survey_processing tgsr
 CROSS JOIN LATERAL jsonb_each(tgsr.json_data->'survey_data') AS qu(question)
 CROSS JOIN LATERAL jsonb_path_query(qu.question, '$.** ? (exists(@.answer))') AS an(answer)

-- where survey_id = 6633968 and id = 4

2021-11-24 19:50:54

Thanks for the feedback. - As far as I can tell, I need the union because I'm iterating through all the values of the 4 different structured json objects. - Good catch, I missed that I had somehow duplicated that. - json functions included in the FROM are implicitly "lateral", so it isn't necessary to write it out (AFAIK) - for #3, I couldn't get that to work. [42883] ERROR: function jsonb_path_query(record, unknown) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts.
David

For #3 I've updated the query, and hope this will works this time with no error. Regarding the UNION, I still don't understand why you need it and what do you mean by "4 different structured json objects" ? Are they different columns of the same table, or from different tables ?
Edouard

I had to make a few edits to what you wrote to make it work, but most importantly you led me down the path to a much better solution. You are correct, my lack of understanding about jsonb_path_query meant I was cobbling unions together. To answer your question, I needed values from a few different keys to be concat'd together to one column. As a bonus, I found a few cases where values weren't being captured in my original query. I've edited the original posting with the final solution I used. Thanks again.
David

In other languages

This page is in other languages

Русский
..................................................................................................................
Italiano
..................................................................................................................
Polski
..................................................................................................................
Română
..................................................................................................................
한국어
..................................................................................................................
हिन्दी
..................................................................................................................
Français
..................................................................................................................
Türk
..................................................................................................................
Česk
..................................................................................................................
Português
..................................................................................................................
ไทย
..................................................................................................................
中文
..................................................................................................................
Español
..................................................................................................................
Slovenský
..................................................................................................................