Scraping Dynamic Websites (Example: Bloomberg)Wednesday, January 11, 2017 6:33 AM
Octoparse enables you to scrape dynamic websites, that is, to scrape the dynamic content which use AJAX technologies to avoid refreshing the whole web page.
In this web scraping tutorial we will teach you how to scrape dynamic content from websites, like bloomberg.com website. We will scrape this website to get the content of latest technology news articles - such as the title of the article, the body text of the article, published date and the author with Octoparse. There're two parts for getting the real-time dynamic data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
The website URL we will use is https://www.bloomberg.com/technology.
The data fields include article title, the body text of article, published date and the article author.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the latest tech news articles from bloomberg.com. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://www.bloomberg.com/technology)
Step 3. Right click the first article under the subtile "Americas" of the Global News . ➜ Create a list of articles with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".
Then the first article has been added to the list. ➜ Click "Continue to edit the list".
Right click the second article ➜ Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the subtitles.
Here, we should note that when we add the second article to the list, Octoparse has helped us to add all of the rest articles both under Europe and Asia to the "Loop Item" box as we observe the item list.
Step 4. Extract the content of the article.
Right click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 5. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the AJAX content from the pages.
Go to Web Page ➜ The Loop Item box ➜ Click Item ➜ Extract Data.
Note: If the URL keeps loading while the content of the website has fully loaded, you can click the multiplication sign (×) to prevent it from loading.
Step 6. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 7. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!