Mailx, A Java 8 Email Web Scrapper using Streams, XPATH, and HtmlUnit

Mailx is a Java 8 program that uses HtmlUnit, XPATH, and regular expressions to crawl a site and report email-like strings by taking advantage of Java 8 features such as streams and lambdas. After processing the command line, Mailx sets up as a headless browser HtmlUnit client, and begins to crawl the site specified in the command line (either a URI or a page URL).

Mailx is capable of following not only static links, but also dynamic links that have the Angular.js ng-click attribute, via function by simulating the click action of a real browser and also simulates a back button press in case the page has already been visited, or if the page is outside the website (something that cannot be anticipated in dynamic links given the web server picks the routes if employing a Model View Controller architecture). Mailx is also capable of searching for emails in html comments. To see how this works, execute MailX on the http://zedbit.com/wp/techical-corner or, if you do want to see hundreds and hundreds of hits, try visiting http://web.mit.edu/

Mailx

Mailx is fully documented and its code can be found in github, at this link.

Comments are closed