How to Scrape WordPress PostsSunday, January 15, 2017 9:02 PM
In this web scraping tutorial, we will teach you how to scrape the daily posts from the WordPress website. We will scrape data fields including post title, author name, post introduction and published date. The target URL for this task is https://dailypost.wordpress.com/posts/.
Step 1. Create a new task with the sample URL
- Enter the sample URL into the search bar on the home screen and click Start
Step 2. Create a pagination loop to click through multiple pages
- Click the icon on the webpage and select Loop click single URL from the tips panel
- Set AJAX timeout to 3s
Step 3. Create a loop item for the daily posts
- Click the title of the first post and choose Select All from the Tips panel
- Click Extract text from the selected links to extract the post title (Building up the extraction loop)
- Click the loop item and change its XPath to //div[@class="archive-list"]/article
- Click on the author name and click Extract text from the selected links to extract the author name (No need to choose Select All because the loop is already established)
- Repeat the last step and extract post introduction and published date
- Turn to the Data Preview section and rename the data fields
Step 4. Check the workflow and element XPaths
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the content from the pages.
If you notice any data missing from the Data Preview section, check the XPath for your data fields. Sometimes you need to write them manually. Check our new help portal for tutorials on XPath.
Step 5. Run the task and export the data collected
You can now run the task on your local machine or in the cloud. You can even schedule it to run on Octoparse's cloud platform.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.