security_tools/tools/redhat_package_manifest_scraper
2020-10-23 04:59:29 -05:00
..
data redhat 7 data 2020-10-23 04:59:29 -05:00
html_table_extractor.py init git 2020-10-23 04:51:14 -05:00
README.md init git 2020-10-23 04:51:14 -05:00

Step 1:

I used this python script https://github.com/x4nth055/pythoncode-tutorials/tree/master/web-scraping/html-table-extractor to extract all of the tables from a redhat documentation URL.

# mk some datadirs
mkdir data
mkdir -p data/redhat8/security_api_results
mkdir -p data/redhat7/security_api_results
mkdir -p data/redhat6/security_api_results

# run the program to scrape and convert the data to csv
python html_table_extractor.py "https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/package_manifest/index"
[+] Found a total of 9 tables.
[+] Saving table-1
[+] Saving table-2
[+] Saving table-3
[+] Saving table-4
[+] Saving table-5
[+] Saving table-6
[+] Saving table-7
[+] Saving table-8
[+] Saving table-9

This will create a csv file per table found in the html-single page result of a given distro.

Step 2:

To process and de-duplicate all of the packages further, I created one master CSV file in each directory for each distro by doing the following filtering on the commandline against each table csv file.

cat table-* | cut -f 2 -d , | sort | uniq | sort > all_redhat7_rpm_package_manifest.csv

and this step was repeated for redhat 8, 7, and 6.

Step 3:

After creating a list of each base set pkg name in the distro, we can then feed these pkgs into a query against the redhat security api using the following example loop:

cd data/redhat8

for pkg in $(cat all_redhat8_rpm_package_manifest.csv); 
  do curl "https://access.redhat.com/hydra/rest/securitydata/cve.json?package=$pkg" > ./security_api_results/${pkg}_security_api_results.json; 
done

this will send out api calls to the security api asking for cves in json format of the given pkgname.