Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Extract data behind a login

Thursday, August 16, 2018

When the target data is behind authentication, it is still possible to access the data with Octoparse. Simply text input the login information (username and password) then click on the "sign in" button to log in.  In this tutorial, we will show you how to extract data behind a login, as well as how to use cookies to optimize the workflow of your task.

web scraping with octoparse - extract behind a login

 

1) Enter login information to sign in

2) Use cookies to optimize the workflow

 

 

 

Enter login information to sign in

  • Click on the textbox for username input on the web page

web scraping with octoparse - extract behind a login

  • Select "Enter text" from Action Tips

web scraping with octoparse - extract behind a login

  • Input the username into the textbox

web scraping with octoparse - extract behind a login

  • Click "OK", the username entered is automatically populated to the username textbox on the web page
  • Follow the same steps to enter the password
  • Click the "Sign In" button on the page

web scraping with octoparse - extract behind a login

  • From Action Tips, select "Click button"
  • web scraping with octoparse - extract behind a login

Octoparse has now logged in the website successfully!

 

 

 

Use cookies to optimize the workflow

1. Save cookies

Most of the time, you can optimize the workflow by saving the cookie in the task after login. This way, Octoparse will send the saved cookie to the website at loading, and there's a good chance the website will remember "you" and skip the login steps. 

  • Log in the website in Octoparse's built-in browser if you have not already done so.
  • Switch to the Workflow Mode by toggling the Workflow switch web scraping with octoparse - extract behind a loginon the top, drag a "Go To Web Page" action to the workflow, position right below the sign in steps. 
  • Enter the URL of the page needed for the capture into the text box for "Page URL"

web scraping with octoparse - extract behind a login

  • Under "Advanced Options", click open "Cache Settings"
  • Select "Use specified Cookie"
  • Click "Load cookie from current web page"
  • Click "OK" to save the settings

 

  • Now as the web page is supposed to "remember" the login and skip the login steps, we'll remove the previously created actions for the login to avoid running into issues when the workflow is executed. Right-click on the action and select "Delete". 

Tips!

A saved cookie is only effective before it gets expired

Cookies come in many different forms. Some have a specific expiration time, others expire immediately as the browser is closed. In Octoparse, the saved cookie will no longer work when it gets expired. To solve this, you will need to go through the log in steps once again by adding in the proper actions in order to obtain and save the updated cookie. 

Your password is well-protected

· In Octoparse, when you enter your password, it is only accessible on your own account.  When a task is exported, the password saved in the task gets removed automatically by Octoparse. 

· Any login information saved will be removed from your account permanently as soon as the task is deleted. 

 

 

2. Clear cookies

As all websites handle cookies differently, to ensure the task workflow will work consistently, you may want to start with the login steps every time the task is executed. To do this, you can clear any cookies saved before the login page is loaded. This way, the target website will always "forget" you and takes you to the login page on which you can enter all the login information. 

  • Click "Go to Web Page" action for the login page
  • Select "Clear cache before opening the web page" within Cache Setting

           

 

Tips!

Entering captcha manually while running local extraction

· When captcha is encountered, you can manually input the captcha when running the task locally.  Cloud Extraction doesn’t support dealing with Captcha.

· Currently Octopares only supports digital captcha and does not support other types, such as reCaptcha v2. 

 

 

Related articles:

Introductory lessons  

Set up a waiting time

Local Extraction

Cloud Extraction

Building Task

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png