Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape data on Instagram

Saturday, October 27, 2018

In this tutorial, we are going to scrape data from Instagram, including the post content, date, image URL, number of likes and location.

To follow through, you may want to use this URL in the tutorial:

https://www.instagram.com/izkiz/

 

Here are the main steps in this tutorial:[Download demo task file here ]

1) "Go To Web Page" - to open the targeted web page

2) Create a pagination loop - to scrape data from multiple posts

3) Extract data - to select the data for extraction

4) Customize the data field using RegEx tool - to revise the field name (Optional)

5) Save and start extraction - to run the task and get data

 

 

 

 

 

1) "Go To Web Page" - to open the targeted web page

· Create the task with "Advanced Mode".

· Paste the URL into the "Extraction URL" box and click "Save URL" to move on

· Change the default built-in browser

The default built-in browser of Octoparse 7 is incompatible with Instagram. To have our target page loaded normally, we need to modify the browser setting.

· Click "Setting"

If you use Octoparse 7.0.2, please have the task saved before modifying the settings

· Switch the default built-in browser to Firefox 45.0.

· Click "Save" to apply the modified setting

  

 

 

 

2)  Create a pagination loop - to scrape data from multiple posts

We can use the “>” button as the“Next page” button to go to the next post. Before creating the pagination loop, we need to go back to the first post.

· Click the first post and click the "A" tag on the bottom of "Action Tips"

When you select an item with URL, the selected tag would be "A". Normally there’s no need to modify, as Octoparse automatically identifies tags of selected items. But for this case, we need to revise the tag on the bottom of "Action Tips".

· Select "Click the link"

 

We have the first post opened now. However, as Instagram loads the content with AJAX, we should set up AJAX Load for the "Click Item" action.

· Uncheck "Auto retry when no response"

· Check "Load the page with AJAX"

· Set up "AJAX Timeout”

 

Now, we can create the “Pagination”

· Click the ">" button

· Click "Loop click next page" on the "Action Tips"

 

Instagram uses AJAX on the ">" button, so we need set up AJAX Load for "Click to Paginate" action as well.

· Click "Load the page with AJAX" on the "Customize Action"

· Set up "AJAX timeout"

 

Tips!

To learn more about dealing with AJAX in Octoparse, please refer to Deal with AJAX .

 

 

 

3)  Extract data - to select the data for extraction

We are now on the second post. When creating a "Loop Item", we should always start with the first item on the first page. In this case, we should go back to the first post.

· Click "Go To Web Page" in the workflow

· Click "Click Item"

Octoparse would open the first post.

· Click the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the "Extract data" step at the appropriate position in the workflow.

 

Now, let’s start extract data.

· Select the data you want

· Click "extract data" on the "Action Tips"

 

Tips!

To learn more about how to adjust workflow, please refer to Getting to know Octoparse .

 

 

 

4) Customize the data field  - to revise the field name(Optional)

· Revise the field name

Typing or selecting from the pre-defined options. 

 

 

 

 

 

 

 

5) Save and start extraction - to run the task and get data

· Click "Start Extraction"

· Select "Local Extraction" to start execution.

 

Below is the sample output.

 

Was this article helpful? Contact us  any time if you need our help!

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png