Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape data on InstagramSaturday, October 27, 2018
In this tutorial, we are going to scrape data from Instagram, including the post content, date, image URL, number of likes and location.
To follow through, you may want to use this URL in the tutorial:
Here are the main steps in this tutorial:[Download demo task file here ]
1) "Go To Web Page" - to open the targeted web page
· Create the task with "Advanced Mode".
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on
· Change the default built-in browser
The default built-in browser of Octoparse 7 is incompatible with Instagram. To have our target page loaded normally, we need to modify the browser setting.
· Click "Setting"
If you use Octoparse 7.0.2, please have the task saved before modifying the settings
· Switch the default built-in browser to Firefox 45.0.
· Click "Save" to apply the modified setting
2) Create a pagination loop - to scrape data from multiple posts
We can use the “>” button as the“Next page” button to go to the next post. Before creating the pagination loop, we need to go back to the first post.
· Click the first post and click the "A" tag on the bottom of "Action Tips"
When you select an item with URL, the selected tag would be "A". Normally there’s no need to modify, as Octoparse automatically identifies tags of selected items. But for this case, we need to revise the tag on the bottom of "Action Tips".
· Select "Click the link"
We have the first post opened now. However, as Instagram loads the content with AJAX, we should set up AJAX Load for the "Click Item" action.
· Uncheck "Auto retry when no response"
· Check "Load the page with AJAX"
· Set up "AJAX Timeout”
Now, we can create the “Pagination”
· Click the ">" button
· Click "Loop click next page" on the "Action Tips"
Instagram uses AJAX on the ">" button, so we need set up AJAX Load for "Click to Paginate" action as well.
· Click "Load the page with AJAX" on the "Customize Action"
· Set up "AJAX timeout"
To learn more about dealing with AJAX in Octoparse, please refer to Deal with AJAX .
3) Extract data - to select the data for extraction
We are now on the second post. When creating a "Loop Item", we should always start with the first item on the first page. In this case, we should go back to the first post.
· Click "Go To Web Page" in the workflow
· Click "Click Item"
Octoparse would open the first post.
· Click the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the "Extract data" step at the appropriate position in the workflow.
Now, let’s start extract data.
· Select the data you want
· Click "extract data" on the "Action Tips"
To learn more about how to adjust workflow, please refer to Getting to know Octoparse .
4) Customize the data field - to revise the field name(Optional)
· Revise the field name
Typing or selecting from the pre-defined options.
5) Save and start extraction - to run the task and get data
· Click "Start Extraction"
· Select "Local Extraction" to start execution.
Below is the sample output.
Was this article helpful? Contact us any time if you need our help!
- Most popular tutorials
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar
- Scraping restaurant info from Grubhub