How to Scrape WordPress PostsSunday, January 15, 2017 9:02 PM
In this web scraping tutorial we will scrape some posts from Wordpress website. We will scrape the title of the posts, the published date , the content of the posts and the posts authors with Octoparse. There're two parts for getting the real-time dynamic data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
The website URL we will use is https:https://dailypost.wordpress.com/posts/.
The data fields include post title, published date, post author, post content and comments number.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the posts from Wordpress website. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next"
Step 2. In this tutorial, we will teach how to use the "List of URLs" input box to realize the functionality of pagination.
We need drag the "Loop Item" box into our workflow ➜ Click the "List of URLs" input box in the Advanced Options and paste the target webpage URLs that we want to scrape from. ➜ Click the "OK" button. ➜ Click the "Save" button.
(URL of the example: https:https://dailypost.wordpress.com/posts/)
Step 3. Right click the first post. ➜ Create a list of target areas with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".
Right click the last post. ➜ Click "Add current item to the list" again (Now we get all the posts with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the posts.
Normally, Octoparse can help us to add all of the rest articles both under Europe and Asia to the "Loop Item" box as we observe the item list.
Step 4. Extract the content of the posts.
Right click the title of the post.➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 5. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the content from the pages.
The Loop Item box ➜ Go to Web Page ➜ The Loop Item box ➜ Click Item ➜ Extract Data
Step 6. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 7. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
After you click 'OK' in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!