undefined

Web Scraping Tutorial: Scrape Websites That Require Login

Wednesday, September 28, 2016 6:13 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Some websites require users to log in with an account and password to show content, which means our target data is behind authentication. Fortunately, it is still possible to access the data with Octoparse. We can either tell Octoparse to input the login information (username and password) for us or log in ourselves in the browse mode and use cookies to optimize the workflow.

Let's say we need to scrape data from Facebook.

 

 

1) Enter login information to sign in

 

  • Create a new task in Octoparse with a Facebook-related URL
  • Click on the username textbox(Email or phone number)
  • Select "Enter text" from the Tips panel
  • Input your username into the textbox, click "Confirm". You'll see the username entered is automatically populated to the username textbox on the web page
  • Repeat the above steps for the "Password" textbox
  • Click the "Log In" button and select "Click button" in the Tips panel

 

2) Use cookies to optimize the workflow

 

Most of the time, you can optimize the workflow by saving the cookies in the task after login. This way, Octoparse will send the saved cookies to the website during loading, and there's a good chance the website will remember "you" and skip the login steps. 

  • Toggle on the browse mode on the top right
  • You can log in to the website just like what you do on a regular browser.
  • After login, go to the "Options" settings of the "Go to web page" action, tick "Use Cookie" and click "Use cookie from the current page". 
  • Click "Apply" to save the settings
  • Now the web page is supposed to "remember" the login and skip the login steps when the crawler is running next time.

 

Problem solved! Now you can move on to build your task workflow.

 

Tips!

1. A saved cookie is only effective before it gets expired

Cookies come in many different forms. Some have a specific expiration time, others expire immediately as the browser is closed. In Octoparse, the saved cookie will no longer work when it gets expired.

To resolve this, you will need to go through the login steps once again under browser mode in order to obtain and save the updated cookie.

2. Your password is well-protected

  • In Octoparse, when you enter your password, it is only accessible on your own account. When a task is exported, the password saved in the task gets removed automatically.
  • Any login information saved will be removed from your account permanently as soon as the task is deleted.

3. Entering captcha manually while running local extraction

If you encounter a captcha, you can manually input the captcha when running the task locally. Cloud Extraction doesn’t support dealing with Captcha.

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline