db:Postgresql-14. This will be an infrequent transformation, and I'm looking for recommendations / improvements that can be made so I may learn/hone my postgres/json skills (and speed/optimize this very slow query).
We receive variable size/structure json objects from an external api.
Each json object is a survey response. Each nested "question/answer" object can have a quite different structure. In total there are about ~5 known structures.
Response objects are stored in a jsonb column that has a jsonb_ops gin index.
Table has about 500,000 rows. Each row's jsonb column object has about 200 nested values.
Our goal is to extract all the nested question/answer responses into another table of id,question,answer. On the destination table we'll be doing extensive querying with FTS and trigram, and are aiming for schema simplicity. That is why I'm extracting to a simple table instead of doing anything more exotic with jsonb querying. There is also a lot of metadata cruft in those objects that I don't need. So I'm also hoping to save some space by archiving the origin table (it's 5GB + indexes).
Specifically I'd love to learn a more elegant way of traversing and extracting the json to the destination table.
And I've been unable to figure out a way to cast the results to actual sql text instead of quoted jsontext (normally I'd use ->>, ::text, or the _text version of the jsonb function)
This is a very simplified version of the json object to ease just running this.
Thank you in advance!
create table test_survey_processing(
id integer generated always as identity constraint test_survey_processing_pkey primary key,
json_data jsonb
);
insert into test_survey_processing (json_data)
values ('{"survey_data": {"2": {"answer": "Option 1", "question": "radiobuttonquesiton"}, "3": {"options": {"10003": {"answer": "Option 1"}, "10004": {"answer": "Option 2"}}, "question": "checkboxquestion"}, "5": {"answer": "Column 2", "question": "Row 1"}, "6": {"answer": "Column 2", "question": "Row 2"}, "7": {"question": "checkboxGRIDquesiton", "subquestions": {"8": {"10007": {"answer": "Column 1", "question": "Row 1 : Column 1"}, "10008": {"answer": "Column 2", "question": "Row 1 : Column 2"}}, "9": {"10007": {"answer": "Column 1", "question": "Row 2 : Column 1"}, "10008": {"answer": "Column 2", "question": "Row 2 : Column 2"}}}}, "11": {"answer": "Option 1", "question": "Row 1"}, "12": {"answer": "Option 2", "question": "Row 2"}, "13": {"options": {"10011": {"answer": "Et molestias est opt", "option": "Option 1"}, "10012": {"answer": "Similique magnam min", "option": "Option 2"}}, "question": "textboxlist"}, "14": {"question": "textboxgridquesiton", "subquestions": {"15": {"10013": {"answer": "Qui error magna omni", "question": "Row 1 : Column 1"}, "10014": {"answer": "Est qui dolore dele", "question": "Row 1 : Column 2"}}, "16": {"10013": {"answer": "vident mol", "question": "Row 2 : Column 1"}, "10014": {"answer": "Consectetur dolor co", "question": "Row 2 : Column 2"}}}}, "17": {"question": "contactformquestion", "subquestions": {"18": {"answer": "Rafael", "question": "First Name"}, "19": {"answer": "Adams", "question": "Last Name"}}}, "33": {"question": "customgroupquestion", "subquestions": {"34": {"answer": "Sed magnam enim non", "question": "customgroupTEXTbox"}, "36": {"answer": "Option 2", "question": "customgroupradiobutton"}, "37": {"options": {"10021": {"answer": "Option 1", "option": "customgroupCHEC KBOX question : Option 1"}, "10022": {"answer": "Option 2", "option": "customgroupCHEC KBOX question : Option 2"}}, "question": "customgroupCHEC KBOX question"}}}, "38": {"question": "customTABLEquestion", "subquestions": {"10001": {"answer": "Option 1", "question": "customTABLEquestioncolumnRADIO"}, "10002": {"answer": "Option 2", "question": "customTABLEquestioncolumnRADIO"}, "10003": {"options": {"10029": {"answer": "OPTION1"}, "10030": {"answer": "OPTION2"}}, "question": "customTABLEquestioncolumnCHECKBOX"}, "10004": {"options": {"10029": {"answer": "OPTION1"}, "10030": {"answer": "OPTION2"}}, "question": "customTABLEquestioncolumnCHECKBOX"}, "10005": {"answer": "Aperiam itaque dolor", "question": "customTABLEquestioncolumnTEXTBOX"}, "10006": {"answer": "Hic qui numquam inci", "question": "customTABLEquestioncolumnTEXTBOX"}}}}}');
create index test_survey_processing_gin_index on test_survey_processing using gin (json_data);
-- the query I'm using (it works, but it is unmanageably slow)
-- EXPLAIN (ANALYZE, VERBOSE, BUFFERS, FORMAT JSON)
select level1.value['question'] question, level1.value['answer'] as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.options.*.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
union
select level1.value['question'] question, jsonb_path_query(level1.value, '$.subquestions.*.*.answer')::jsonb as answer ,tgsr.json_data['survey_data']
from test_survey_processing tgsr,
jsonb_each(tgsr.json_data['survey_data']::jsonb) level1
-- where survey_id = 6633968 and id = 4
FOLLOWUP EDIT AFTER REFINING AND GETTING THE RESULT I NEEDED
This is the query I ended up running. It took 11min to process and insert 34million records. Which is fine as it is one time operation.
A few comments about the changes I made
-I used -> and ->> instead of [subscripting] since I read that even in pg14, subscripting doesn't use indexes (not sure if that matters in the FROM)
-the "to_json(...) #>> '{}'"
is how I converted the json string to an unquoted string based on this: stack overflow answer
create table respondent_questions_answers as
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question, '' as sub_question,
to_json(jsonb_path_query(level1.value, '$.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
to_json(jsonb_path_query(level1.value, '$.options.*.option')) #>> '{}' as sub_question,
to_json(jsonb_path_query(level1.value, '$.options.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
to_json(jsonb_path_query(level1.value, '$.subquestions.*.*.question')) #>> '{}' as sub_question,
to_json(jsonb_path_query(level1.value, '$.subquestions.*.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1
union
select tgsr.id,tgsr.survey_id,level1.value ->> 'question' question,
to_json(jsonb_path_query(level1.value, '$.subquestions.*.question')) #>> '{}' as sub_question,
to_json(jsonb_path_query(level1.value, '$.subquestions.*.answer')) #>> '{}' as answer
from test_survey_processing tgsr, jsonb_each(tgsr.json -> 'survey_data') level1;
Final edit after accepting below answer as solution
Thank's to @Edouard H. answer and with a better understanding of how to correctly use jsonb_path_query, I was able to eliminate all the UNION SELECT
, discover some values that had been missing, and remove the need for the to_json hack. Even though the CROSS JOIN LATERAL
is implicit with json functions, it is better form to include JOIN
instead of commas as they are more tightly bound, and easier to read. Below is the final query I used.
SELECT concat_ws(' ',
qu.value::jsonb->>'question'
, an.answer::jsonb->>'question'
, an.answer::jsonb->>'option') AS question
, an.answer::jsonb->>'answer' AS answer
-- , tgsr.json_data->>'survey_data'
FROM test_survey_processing tgsr
CROSS JOIN LATERAL jsonb_each(tgsr.json_data->'survey_data') AS qu
CROSS JOIN LATERAL jsonb_path_query(qu.value::jsonb, '$.** ? (exists(@.answer))') AS an(answer)