Scrape Data from Import.io: Website Made by WordpressThursday, September 29, 2016 8:16 AM
(Download my extraction task of this tutorial HERE just in case you need it.)
Wordpress is regarded as the best place for personal blog and business site. It helps you easily create your own website or blog and write about the things that you want to share with others.
Import.io is one of the examples using Wordpress to layout the website. Here I will give you an example to show you how to scrape data from strong competitors to optimize your products.
Import.io is more mature than in data extraction area and I want to know how it operate their product. And I believe their blogs is a good source for me to get such information. In order to get real-time data and better analyze the blogs, I decide to extract their blogs. Below I will show you how to do that.
Step 1. Choose “Advanced Mode” ➜ Click “Start”➜ Complete basic information. ➜ Click “Next”.
Step 2. Enter the target URL of Import.io in the built-in browser. ➜ Click “Go” icon to open the webpage.
( URL of the example: https://www.import.io/blog/ )
Step 3. Click the pagination link “Older”. Click “Expand the selection area” until “Loop click in the element” appears. ➜ Choose “Loop click in the element” to turn the page.
(Note: If you want to extract some information from every page of search result, you need to add a page navigation action.)
Step 4. Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”. Then the first highlighted link has been added to the list. ➜ Click “Continue to edit the list”.
Click the second highlighted link ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page.
Step 5. Extract the session you want. ➜ Click the session. ➜ Select “Extract text”.
Step 6. All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify.
Step 7. Drag the second “Loop Item” before “Click to paginate” action in the Workflow Designer so that we can grab all the elements of sections from multiple pages.
Step 8. Click “Next” ➜ Click “Next” ➜ Click “Cloud Extraction”. You could see your task status on the task list.
(Note: Cloud Extraction is not available for the Free Edition. For more information about different editions, you could click HERE.)
Step 9. You could also schedule your cloud extraction settings to meet your special need. For example, to extract the data once a week on Monday, you should click “Weekly”, “Monday”, “0:00” ➜ Click “Start”. This is extremely important before Cloud Extraction automatically extracts all the selected data on your chosen time.
Step 10. You could also click “Cloud Extraction” to run the task immediately without having to wait for the setting time.
Step 11. The data extracted will be shown in "Data Extracted" pane. Click “View Data” button to view data. You then could export the results to Excel file, databases or other formats and save the file to your computer. See the results exported below.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!