How to Extract Emails that would display by clicking the "Reply" button (Example: Craigslist)Friday, April 29, 2016 4:46 AM
What if you need to extract the data showed after you click a button on a web page?
When no data is provided by the button, the place for the button would be replace with some text. In this case, we need to turn to next page/URL to continue extracting other data.
The document is to show you how to configure rule to extract all the emails that only displayed by clicking a button on the web page.
Let’s take an example. Assume that we are job seekers and want to extract all the email addresses of Hiring manager who post jobs on Craigslist.
The example link with lots of URLs: http://flagstaff.craigslist.org/search/jjj.
First of all. Enter the task name. Save your task to a category. Then click “next” to the second step.
Open the example link in the build-in browser. To extract emails from all the web pages, we configure pagination action. Wait until the page loaded, scroll down the page to the bottom. Click on “Next” page link. Select “Loop click the element”.
We need to create a list of URLs to get into the detail page.
Click on the first URL > Create a list of item > Select Add current item to the list > Select Continue to edit the list.
Click on the second one > Add current item to the list > Finish creating list > Loop
Now we’re on the detail page. Next, extract the job title and the reply email.
Some web pages don’t have the reply button, and the location of the button is replaced with the text “reply below”.
In this case, we need to create a branch judgment with one branch for the web page with a reply button and the other for the web page without a reply button.
Drag the “If-else” action to the workflow designer. There are two branches inside the “Branch Judgment”.
Click the Branch Judgment box, we enter the XPath of the button in the “wait until the element appears” bar.
We need to get the path expression of the button for Octoparse to locate the button. Open one of the URL with the button on the web browser and use the “inspect element” function to find out the XPath of the button. Copy it.
Back to Octoparse. Click the Branch Judgment box, then paste the XPath of the button in the “wait until the element appears” bar. Click save.
Click on the reply button, then choose “Click an item”.
In the workflow designer, drag the “Click an item” action inside the left branch.
Click the email in the built-in browser, then choose “extract text”.
Click the job title, then choose “extract text”. Define these two fields. Click the tick button.
Click the left branch box. In the popup window, click OK. Choose the last option “The current page contains the elements” and enter the XPath of the button.
AJAX allows the example website to update automatically without reloading the whole web page.
Virtually, AJAX request is mainly to achieve the goal to update partial data of the web page, without needing to refresh the entire page. We set the timeout parameter, that is the amount of time, to wait for the AJAX requests to be finished so that we can execute the next step.
So in each “Click Item ”step in workflow designer, we need to tick the option “Load page with AJAX” and select “15 seconds” from the Ajax Timeout list in the advanced options. Click save.
Drag the whole “Branch Judgment” into the “Loop Item”, following the “Click Item” action. Then drag the whole “Loop Item” into the “Cycle Pages”, before the “Click to paginate” action. Click save.
We’ve finished configuring rule. Click Next and Next on the top right corner. Then click Local Extraction to run the task.
The data extracted will be showed in this pane and we can also see the configured rule of the task. Check out the built-in browser to see if the task runs as expected.
Export the results to Excel files, or other formats and save the file to the computer.
Download Octoparse today to extract data from target web page! It's FREE.
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.