## Step 1: I used this python script https://github.com/x4nth055/pythoncode-tutorials/tree/master/web-scraping/html-table-extractor to extract all of the tables from a redhat documentation URL. ``` # mk some datadirs mkdir data mkdir -p data/redhat8/security_api_results mkdir -p data/redhat7/security_api_results mkdir -p data/redhat6/security_api_results # run the program to scrape and convert the data to csv python html_table_extractor.py "https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/package_manifest/index" [+] Found a total of 9 tables. [+] Saving table-1 [+] Saving table-2 [+] Saving table-3 [+] Saving table-4 [+] Saving table-5 [+] Saving table-6 [+] Saving table-7 [+] Saving table-8 [+] Saving table-9 ``` This will create a csv file per table found in the html-single page result of a given distro. ## Step 2: To process and de-duplicate all of the packages further, I created one master CSV file in each directory for each distro by doing the following filtering on the commandline against each table csv file. ``` cat table-* | cut -f 2 -d , | sort | uniq | sort > all_redhat7_rpm_package_manifest.csv ``` and this step was repeated for redhat 8, 7, and 6. ## Step 3: After creating a list of each base set pkg name in the distro, we can then feed these pkgs into a query against the redhat security api using the following example loop: ``` cd data/redhat8 for pkg in $(cat all_redhat8_rpm_package_manifest.csv); do curl "https://access.redhat.com/hydra/rest/securitydata/cve.json?package=$pkg" > ./security_api_results/${pkg}_security_api_results.json; done ``` this will send out api calls to the security api asking for cves in json format of the given pkgname.