Harvard Business Review has put data scientist as the sexiest job of the 21st century. In this article, with the assistance of Octoparse, one of the best free web data scraping tools, we aggregated the resources and tools that you may need to become a data scientist.
How to Become a Data Scientist
1. Learning resources: Courses, Degrees/Certificates, Books;
2. Tools: Data Extractors, Data Analytics, Reporting.
3. Data Science Competitions/Programs
20 Online Courses in Data Science
1. Data Science Specialization
Creator: John Hopkins University
This Specialization covers the concepts and tools you’ll need throughout the entire data science pipeline, from asking the right kinds of questions to making inferences and publishing results. In the final Capstone Project, you’ll apply the skills learned by building a data product using real-world data. Upon completion, students will have a portfolio demonstrating their mastery of the material.
2. Introduction to Data Science in Python
Creator: University of Michigan
This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating CSV files, and the numpy library.
3. Applied Plotting, Charting & Data Representation in Python
Creator: University of Michigan
This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library.
4. Applied Machine Learning in Python
Creator: University of Michigan
This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods.
5. Applied Text Mining in Python
- Creator: University of Michigan
- This course will introduce the learner to text mining and text manipulation basics.
6. Applied Social Network Analysis in Python
Creator: University of Michigan
This course will introduce the learner to network analysis through tutorials using the NetworkX library.
This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.
7. What is Data Science?
In this course, we will meet some data science practitioners and we will get an overview of what data science is today.
8. Open Source tools for Data Science
In this course, you’ll learn about Jupyter Notebooks, RStudio IDE, Apache Zeppelin, and Data Science Experience.
9. Data Science Methodology
you will learn – the major steps involved in tackling a data science problem. – The major steps involved in practicing data science, from forming a concrete business or research problem to collecting and analyzing data, building a model, and understanding the feedback after model deployment. – What data scientists think!
10. Applied Data Science
This is an action-packed specialization that is for data science enthusiasts who want to acquire practical skills for real world data problems. It appeals to anyone interested in pursuing a career in Data Science and already has foundational skills (or has completed the Introduction to Applied Data Science specialization). You will learn Python – no prior programming knowledge necessary. You will then learn about data visualization and data analysis. Through our guided lectures, labs, and projects you’ll get hands-on experience tackling interesting data problems.
11. Databases and SQL for Data Science
The purpose of this course is to introduce relational database concepts and help you learn and apply knowledge of the SQL language. It is also intended to get you started with performing SQL access in a data science environment.
12. Data Science Math Skills
This course is designed to teach learners the basic math you will need in order to be successful in almost any data science math course and was created for learners who have basic math skills but may not have taken algebra or pre-calculus.
13. Data Science: Wrangling
This course covers several standard steps of the data wrangling process like importing data into R, tidying data, string processing, HTML parsing, working with dates and times, and text mining.
14. Data Science: Productivity Tools
15. Data Science Research Methods: Python Edition
16. How to Win a Data Science Competition: Learn from Top Kagglers
Created by: National Research University Higher School of Economics
If you want to break into competitive data science, then this course is for you! Participating in predictive modeling competitions can help you gain practical experience, improve and harness your data modeling skills in various domains such as credit, insurance, marketing, natural language processing, sales forecasting and computer vision to name a few.
17. Introduction to Computational Thinking and Data Science
Instructors: Prof. Eric Grimson; Prof. John Guttag; Dr. Ana Bell
6 0002 is the continuation of 6 0001 Introduction to Computer Science and Programming in Python and is intended for students with little or no programming experience. It aims to provide students with an understanding of the role computation can play in solving problems and to help students, regardless of their major, feel justifiably confident of their ability to write small programs that allow them to accomplish useful goals. The class uses the Python 25 programming language.
18. Introduction to Computer Science and Programming in Python
Instructors: Dr. Ana Bell; Prof. Eric Grimson; Prof. John Guttag
Introduction to Computer Science and Programming in Python is intended for students with little or no programming experience. It aims to provide students with an understanding of the role computation can play in solving problems and to help students, regardless of their major, feel justifiably confident of their ability to write small programs that allow them to accomplish useful goals. The class uses Python 25 programming language.
19. Statistical Thinking and Data Analysis
Instructor(s): Prof. Cynthia Rudin; Allison Chang (Teaching Assistant); Dimitrios Bisias (Teaching Assistant)
This course is an introduction to statistical data analysis. Topics are chosen from applied probability, sampling, estimation, hypothesis testing, linear regression, analysis of variance, categorical data analysis, and nonparametric statistics.
20. SQL for Data Science
University of California, Davis
This course is designed to give you a primer in the fundamentals of SQL and working with data so that you can begin analyzing it for data science purposes.
Data Science Degrees/Certificates
1. Master of Computer Science
University of Illinois at Urbana-Champaign
The Master of Computer Science is a non-thesis degree that requires 32 credit hours of coursework. Students can complete the eight courses required for the Master of Computer Science at their own pace, in as little as one year or as many as five years. Students receive lectures through the Coursera platform, but are advised and assessed by Illinois faculty and teaching assistants on a rigorous set of assignments, projects, and exams required for university degree credit.
The Master of Computer Science assesses $19,200 in tuition for the 32 credit-hour degree.
2. Bachelor of Science in Computer Science
University of London
Tuition: £9,600-£17,000, depending upon the geographic location of students.
The degree, designed by the team at Goldsmiths, University of London, is designed to give you a strong foundation in Computer Science and specialized knowledge of topics such as Data Science, Artificial Intelligence, Virtual Reality, and Web Development. Your learning will involve industry and academic case studies to help you understand your studies in terms of real-world problems
3. Data Science
Tuition: $441.90 USD for the entire program.
You will learn: Fundamental R programming skills; Statistical concepts such as probability, inference, and modeling and how to apply them in practice; Gain experience with the tidyverse, including data visualization with ggplot2 and data wrangling with dplyr; Become familiar with essential tools for practicing data scientists such as Unix/Linux, git and GitHub, and RStudio; Implement machine learning algorithms; In-depth knowledge of fundamental data science concepts through motivating real-world case studies.
4. Microsoft Professional Program in Data Science
Tuition: $1,089 for the entire program
You will learn: Use Microsoft Excel to explore data; Use Transact-SQL to query a relational database; Create data models and visualize data using Excel or Power BI; Apply statistical methods to data; Use R or Python to explore and transform data; Follow a data science methodology; Create and validate machine learning models with Azure Machine Learning; Write R or Python code to build machine learning models; Apply data science techniques to common scenarios; Implement a machine learning solution for a given data problem.
6. Master of Computer Science
Arizona State University
You will choose 10 courses out of 20 course options in order to develop expertise in emerging in-demand technologies. Choose from areas of focus such as AI, Software Engineering, Cloud Computing, Big Data, and Cybersecurity. You’ll also create a project portfolio that you’ll use to showcase your experience to prospective employers.
Books About Data Science
1. The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
Author: Carl Shan
25 experts in the industry gave out some advice in this handbook, very helpful for starters.
2. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Author: Foster Provost and Tom Fawcett
Data Science for Business introduces the fundamental principles of data science and walks you through the “data-analytic thinking” necessary for extracting useful knowledge and business value from the data you collect. This guide also helps you understand the many data-mining techniques in use today.
3. Doing Data Science: Straight Talk from the Frontline
Author: Cathy O’Neil and Rachel Schutt
In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
4. Data Science From Scratch With Python: Step By Step Guide
Author: Peters Morgan
If you are looking for a complete step by step guide to data science using Python from scratch, this book is for you. After his great success with his first book “Data Analysis from Scratch with Python”, Peters Morgan publish his second book focusing now in data science and machine learning. It is considered by practitioners as the easiest guide ever written in this domain.
5. Data Science For Dummies (For Dummies (Computers))
Author: Lillian Pierson
Data Science For Dummies is the perfect starting point for IT professionals and students who want a quick primer on all areas of the expansive data science space. With a focus on business cases, the book explores topics in big data, data science, and data engineering, and how these three areas are combined to produce tremendous value.
6. Introduction to Probability, Statistics, and Random Processes
Author: Hossein Pishro-Nik
This book introduces students to probability, statistics, and stochastic processes. It can be used by both students and practitioners in engineering, various sciences, finance, and other related fields. It provides a clear and intuitive approach to these topics while maintaining mathematical accuracy. You can also find courses and videos online.
7. OpenIntro Statistics
Author: David M Diez and Christopher D Barr
The OpenIntro project was founded in 2009 to improve the quality and availability of education by producing exceptional books and teaching tools that are free to use and easy to modify. Their inaugural effort is OpenIntro Statistics. Corresponding courses and videos can be found in:
8. Statistical Inference
Author: George Casella
It’s a textbook for fresh graduates in many colleges.
Discusses both theoretical statistics and the practical applications of the theoretical developments. Includes a large number of exercises covering both theory and applications.
9. Applied Linear Statistical Models
Applied Linear Statistical Models is the long-established leading authoritative text and reference on statistical modeling. The fifth edition provides an increased use of computing and graphical analysis throughout, without sacrificing concepts or rigor. In general, the 5e uses larger data sets in examples and exercises, and where methods can be automated within software without loss of understanding, it is so done.
10. An Introduction to Generalized Linear Models
Authors: Annette J. Dobson and Adrian G. Barnett
It provides a cohesive framework for statistical modeling, with an emphasis on numerical and graphical methods. This new edition of a bestseller has been updated with new sections on non-linear associations, strategies for model selection, and a Postface on good statistical practice.
Data Extractors for Scientist
Data Export Format: Excel, HTML, CSV, JSON, and Databases
Octoparse is the best free web data extractor with comprehensive features, which supports extracting almost all kinds of data from the websites. There are two kinds of applied modes – Wizard Mode and Advanced Mode – for non-programmers to quickly get used to Octoparse.
Moreover, its Cloud Extraction enables you to run the scraper in the cloud and save the data in Octoparse cloud, which could empower everyone access to scraping dynamic information in real-time. We not only provide SaaS, but Octoparse also provides customization services for web scraper setup and data collection.
Mozenda is a cloud web scraping service (SaaS) with useful utility features for data extraction. Mozenda Web Console is a web-based application that allows you to run your Agents (scrape projects), view and organize your results, and export or publish the extracted data to cloud storage such as Dropbox, Amazon and Microsoft Azue. Agent Builder is a Windows application used to build your data project.
Scraper is a Chrome extension with limited data extraction features but it’s helpful for doing online research and exporting data to Google Spreadsheets. This tool is intended for beginners as well as experts who can easily copy data to the clipboard or store to the spreadsheets using OAuth. Scraper is a free web crawler tool, which works right in your browser and auto-generates smaller XPaths for defining URLs to crawl. It may not offer all-inclusive crawling services, but novices also needn’t tackle messy configurations.
Starting Price: $25.00/month/user
Docparser allows you to extract specific data fields from PDFs and scanned documents, convert PDF to text, PDF to JSON, PDF to XML, convert PDF tables into CSV or Excel, etc.
5. Visual Scraper
VisualScraper is another great free, non-coding web scraper with a simple point-and-click interface that could be used to collect data from the web. You can get real-time data from several web pages and export the extracted data as CSV, XML, JSON or SQL files. Besides SaaS, VisualScraper offers web scraping services such as data delivery services and creating software extractors services.
Starting Price: $2000/month
With no coding, no servers or expensive DIY software required, Datahut is a fully managed web data extraction service, which supports delivering ready-to-use data feeds from the web to help quickly build apps and conduct business analysis.
WebHarvy Single User License: USD 129 00/year
WebHarvy is a point-and-click web scraping software. It’s designed for non-programmers. WebHarvy can automatically scrape Text, Images, URLs & Emails from websites, and save the scraped content in various formats. It also provides built-in scheduler and proxy support which enables anonymously crawling and prevents the web scraping software from being blocked by web servers, you have the option to access target websites via proxy servers or VPN.
8. OutWit Hub
OutWit Hub is a Firefox add-on with dozens of data extraction features to simplify your web searches. This web crawler tool can browse through pages and store the extracted information in a proper format. OutWit Hub offers a single interface for scraping tiny or huge amounts of data per need. OutWit Hub lets you scrape any web page from the browser itself and even creates automatic agents to extract data and format it per settings.
9. Data Integration
Free Version: Yes
Talent Data Fabric is an integration platform that lets customers seamlessly move between batches, streaming and real-time while running on-premises, in the Cloud or with Big Data. It can easily connect big data sources, cloud applications, and databases with a secure cloud integration platform-as-a-service (iPaaS).
As a browser-based web crawler, Dexi.io allows you to scrape data based on your browser from any website and provide three types of robots for you to create a scraping task – Extractor, Crawler, and Pipes. The freeware provides anonymous web proxy servers for your web scraping and your extracted data will be hosted on Dexi.io’s servers for two weeks before the data is archived, or you can directly export the extracted data to JSON or CSV files. It offers paid services to meet your needs for getting real-time data.
Data Analytics Tools
by Information Builders
Information Builders WebFOCUS is the industry’s most flexible and pervasive BI and analytics platform, able to deliver a broad range of governed analytical tools, applications, reports, and documents to any and all business stakeholders.
2. Minitab 18
Starting Price: $1,495.00/one-time/user
Minitab is the leading statistical software used for quality improvement and statistics education worldwide.
Stata is the solution for your data science needs. Obtain and manipulate data. Explore. Visualize. Model. Make inferences. Collect your results in reproducible reports.
by SAS Institute
A statistical analysis system provides a wide range of statistical software, ranging from traditional analysis of variance to exact methods and dynamic data visualization techniques.
5. MicroStrategy Enterprise Analytics
Comprehensive enterprise analytics and mobile platform that delivers a full range of analytical and reporting capabilities
by CaseWare International
CaseWare IDEA® is a comprehensive, powerful and easy-to-use data analysis tool that quickly analyzes 100% of your data, guarantees data integrity and speeds your analysis, paving the way to faster, more effective audits.
by QSR International
More than just a tool for organizing and managing data, NVivo helps you think differently about your research, uncover more and back it all up with rigorous evidence.
by Scientific Software Development
ATLAS.ti is a sophisticated tool to help you arrange, reassemble, and manage your material in creative, yet systematic ways.
by Stormy Range Software
Free Version Yes
QueryStorm is a development and data processing plugin for Excel. It offers SQL and C# support in Excel, making it much easier for tech people to interact with data in spreadsheets.
10. Toucan Toco
by Toucan Toco
Gartner’s Comment: Toucan’s a great company to work with. The tool is user-friendly, easy to install, easy to deploy, and does a great job at making data digestible. The team is helpful, professional and takes you through their agile methodology which allows us to push the project out quickly to put it in the hands of our collaborators.
Reporting Tools for Scientists
QlikView combines ETL, data storage, multi-dimensional analysis and the end-user interface in the same package – so deployments are lightning-fast and ongoing maintenance is simple.
TapReports is a cloud-based collaboration and reporting solution that allows businesses to manage communication with their clients and generate customizable marketing reports and interactive sales reports for their clients.
3. IBM Cognos Analytics
IBM Cognos Analytics is a cohesive performance management and business intelligence solution, with budgeting, strategic planning, forecasting, and consolidation.
4. Zoho Reports
Zoho Reports is a self-service business intelligence and analytics software that allows you to create insightful dashboards and data visualizations.
5. SAP Crystal Reports
by SAP Crystal Reports
With SAP Crystal Reports, you can create powerful, richly formatted, dynamic reports from virtually and data sources delivered in dozens of formats in up to 24 languages.
Solver specializes in providing world-class financial reporting, budgeting, and analysis with push-button access to all data sources that drive company-wide profitability. BI360 is available for cloud and on-premise deployment, focusing on reporting, budgeting, dashboards, and data warehouse.
Domo is a cloud-based business management suite that integrates with multiple data sources, including spreadsheets, databases, social media and any existing cloud-based or on-premise software solution.
8. Exchange Reporter Plus
Microsoft Exchange serves as the hub for all email communications in most corporate environments that use Active Directory technology.
9. Izenda Reports
Izenda is a business intelligence (BI) platform that enables real-time data exploration and reports creation.
10. Grow BI Dashboard
Grow is a Cloud-based business analytics and reporting solution suitable for small to midsize organizations. The solution allows users to create customizable dashboards for monitoring business workflows and key activities.
12 Data Science Competitions/Programs
Kaggle is a subsidiary of Alphabet now, it’s a platform for predictive modeling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users.
CrowdAI is an open-source platform of the École Polytechnique Fédérale de Lausanne in Switzerland, for hosting open data challenges and gaining insight into how the problems in question were solved.
CrowdANALYTIX is a crowdsourcing platform for building customized AI solutions built by a global community of numerous data scientists. It is also an AI-driven platform for auto-creating context-aware product attributes and meta-tags for retail product catalogs.
Datascience.net is the first French-speaking data science platform, launched in 2013 by a pool of data specialists. It bridges the gap between organizations having complex data-centric problems, and the best data scientists willing to solve them.
5. Hacker Rank
6. Inno Centive
InnoCentive is an open innovation and crowdsourcing company with its worldwide headquarters in Waltham, MA. They enable organizations to put their unsolved problems and unmet needs, which are framed as ‘Challenges’, out to the crowd to address.
7. Top Coder
Topcoder is a crowdsourcing company with an open global community of designers, developers, data scientists, and competitive programmers. Topcoder sells community services to corporate, mid-size, and small-business clients, and pays community members for their work on the projects. Topcoder also organizes the annual Topcoder Open Tournament and a series of smaller regional events. (Wikipedia)
8. Hacker Earth
Hacker Earth is a startup technology company based in Bangalore, India that provides recruitment solutions. Its clients include Adobe, Altimetrik, Citrix Systems, InMobi, Symantec, and Wipro. It has a competitive programming platform that supports over 32 programming languages (including C, C++, Python, Java, and Ruby). (Wikipedia)
9. Analytics Vidhya
10. Driven Data