security_tools/tools/redhat_package_manifest_scraper/README.md

## Step 1:
I used this python script https://github.com/x4nth055/pythoncode-tutorials/tree/master/web-scraping/html-table-extractor
to extract all of the tables from a redhat documentation URL.

```
# mk some datadirs
mkdir data
mkdir -p data/redhat8/security_api_results
mkdir -p data/redhat7/security_api_results
mkdir -p data/redhat6/security_api_results

# run the program to scrape and convert the data to csv
python html_table_extractor.py "https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/package_manifest/index"
[+] Found a total of 9 tables.
[+] Saving table-1
[+] Saving table-2
[+] Saving table-3
[+] Saving table-4
[+] Saving table-5
[+] Saving table-6
[+] Saving table-7
[+] Saving table-8
[+] Saving table-9
```

This will create a csv file per table found in the html-single page result of a given distro.

## Step 2:
To process and de-duplicate all of the packages further, I created one master CSV file in each directory for each distro by doing the following filtering on the commandline against each table csv file.

```
cat table-* | cut -f 2 -d , | sort | uniq | sort > all_redhat7_rpm_package_manifest.csv
```

and this step was repeated for redhat 8, 7, and 6.

## Step 3:
After creating a list of each base set pkg name in the distro, we can then feed these pkgs into a query against the redhat security api using the following example loop:

```
cd data/redhat8

for pkg in $(cat all_redhat8_rpm_package_manifest.csv);
  do curl "https://access.redhat.com/hydra/rest/securitydata/cve.json?package=$pkg" > ./security_api_results/${pkg}_security_api_results.json;
done
```

this will send out api calls to the security api asking for cves in json format of the given pkgname.