Setting Up Python for Aspiring Data Scientists

Whenever the term data science is mentioned, the majority of the population conjures up mental images of spreadsheets, charts and graphs. That’s not necessarily a wrong notion as statistics and visualization are fundamental to the craft. However, most people don’t recognize that the bulk of the work done by data scientists isn’t limited to just analytics – an even bigger part of the craft is actually computer programming.

You see, modern data science is all about leveraging the power of machines in crunching massive amounts of numerical information which would overwhelm even the brightest of humans. While our brains are great at critical thinking and solution design, computers are far superior to us when it comes to performing complex and iterative mathematical operations. Math problems that take the average person minutes or hours to solve correctly only take computers split seconds to do flawlessly.

Of course, computers won’t do professional-grade data science, analytics and machine learning operations out of the box. For that, you’ll need programming know-how to get the computer to do what you need it to.

Enter Python, the programming language of choice for most data scientists. Developed by Guido Van Rossum in 1991, it’s a fully object-oriented language which means it’s great for modeling real-world problems while staying relatively easy to learn even for coding newcomers. In this post, we’ll cover the nature of this programming language and how you can get yourself ready to learn it.

What is Python?

Python is a high-level, interpreted programming language known for its clear syntax and readability. It supports multiple programming paradigms and is endowed with a comprehensive standard library. Python’s simplicity allows beginners to pick it up quickly, yet its vast array of libraries and frameworks make it robust enough for complex applications.

Due to its versatility and simplicity, the language has far-reaching applications in data science, machine learning, software development, web development, and game development just to name a few. It’s used by some of the world’s best known brands such as Dropbox, Instagram, Firaxis and many more.

Why is Python the Preferred Data Science Language?

Python isn’t the only programming language used in data science. Languages like R and SQL are completely viable alternatives. However, today’s data scientists predominantly use Python as their first language due to the following:

  • Ease of Learning and Use. Python’s syntax is clean and intuitive, closely resembling natural language. This lowers the barrier to entry for individuals new to programming, allowing them to focus on solving data science problems rather than on complex programming syntax. For seasoned programmers, Python’s simplicity translates to faster development times and less room for errors.
  • Comprehensive Libraries for Data Science. Python’s ecosystem is rich with libraries specifically designed for data science. NumPy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these elements. Pandas offers data structures and operations for manipulating numerical tables and time series, making data cleaning, analysis, and preprocessing more efficient. Scikit-learn is a tool for data mining and data analysis, providing simple and efficient tools for predictive data analysis.
  • Visualization Libraries. Visualization is a key part of data science, and Python provides many libraries for this purpose. Matplotlib is the most widely used Python library for 2D graphics. It can generate plots, histograms, power spectra, bar charts, error charts, and scatterplots, to name just a few. Other libraries like Seaborn build on Matplotlib and enable the creation of more attractive and informative statistical graphics.
  • Community and Support. Python benefits from a strong, collaborative community that contributes to its vast selection of libraries and frameworks. This community fosters an environment of support through forums, social media, and numerous conferences worldwide. For data scientists, this means having access to the latest tools, methodologies, and best practices, as well as troubleshooting help and advice from peers and experts in the field.
  • Flexibility and Scalability. Python is used not only for small-scale projects but also in large, complex systems. It allows data scientists to start with a simple prototype and scale up to handle larger data sets or more complex analysis without needing to switch to another language.

Anaconda Explained

Anaconda is a distribution of Python and R specifically aimed at data science and machine learning. It can be downloaded and installed like a regular desktop application in both Mac and Windows, though it’s not an end-user app in the traditional sense. It’s more of a launcher for the Python language and its multitude of packages.

Pros of using Anaconda

The advantages of using Anaconda for package management and deployment in data science are multifaceted:

  • Simplified Package Management. One of the key benefits of using Anaconda is the ease with which you can manage packages. Conda, the package manager that comes with Anaconda, helps handle library dependencies efficiently. This means that when you install a data science package that requires other specific versions of libraries, Conda automatically resolves and installs the correct versions, saving you from the painstaking task of matching dependencies manually.
  • Pre-built Packages for Data Science. Anaconda’s repository hosts over 1,500 pre-built data science packages. This extensive suite includes not just the popular ones like NumPy, pandas, and scikit-learn, but also more specialized tools that data scientists might need. With these pre-built packages, you can bypass the often complex compilation process required for many scientific packages, especially on Windows.
  • Environment Management. Conda also serves as an environment manager, which allows you to create isolated environments for different projects. Each environment can have its own set of packages and package versions, which means you can work on multiple projects with differing requirements simultaneously without conflicts. This is particularly useful in data science, where one project may be running on Python 2.7 with an older version of NumPy, while another requires Python 3.8 with the latest version of pandas.
  • Ease of Deployment. Anaconda simplifies the deployment of data science applications. Whether you’re deploying to a production server or sharing with a colleague, Conda environments can be exported and replicated. This means that you can be confident that your code will run as expected on any machine, eliminating the “it works on my machine” problem.

Cons of Using Anaconda

There aren’t many drawbacks to using Anaconda but a small number of users have found these challenges with the package:

  • Limited Performance on Lower-End Computers. Anaconda’s ability to handle machine learning and big data operations, while one of its strengths, can also be a drawback when used on lower-end computers. These operations often require substantial computational power for processing large datasets and performing complex calculations, tasks that can be resource-intensive.
  • Bulky Installation. Anaconda’s comprehensive nature means its default installation includes a large number of data science libraries and tools, many of which may be unnecessary for users with specific needs or those just starting out. This can lead to a substantial footprint on the disk, making it less ideal for environments with limited storage space.
  • Complexity for New Users. The abundance of tools and packages that come with Anaconda can be daunting to beginners who are not yet familiar with Python’s ecosystem. Learning to navigate Conda environments and understanding which packages are necessary for a given project can add to the learning curve.

Downloading Anaconda

Downloading Anaconda is a straightforward process, but it’s important to select the right version for your needs and operating system. Here’s a more detailed guide:

  1. Visit the Official Anaconda Website. Start by navigating to the official Anaconda website. This site is the primary source for downloading the Anaconda distribution and ensures that you are getting the legitimate, most recent version.
  2. Choose the Right Versio On the Anaconda download page, you’ll find versions of Anaconda for Windows, macOS, and Linux. It’s crucial to choose the version that corresponds to your operating system.
  3. Select Python Version. Anaconda typically offers installers for different Python versions. You might see options for Python 3.x and sometimes for Python 2.x (though the latter is increasingly rare and not recommended due to Python 2’s end of life). Most users should opt for the latest Python 3.x version to ensure they have the most recent features and security updates.
  4. Consider Your System Architecture. Download the installer that matches your system’s architecture: 32-bit or 64-bit. Most modern computers are 64-bit, and this is the recommended version as it can handle larger amounts of memory more efficiently. If you’re unsure about your system type, you can check this in your system’s settings or properties.
  5. Download the Installer. Click on the appropriate installer to begin the download. The file size can be quite large (several hundred megabytes), so the download might take some time depending on your internet connection

Installing Anaconda on Windows (PC)

Installing Anaconda on a Windows PC is a very straightforward process. Simply follow these steps and you should be good to do in no time:

  1. Locate the Installer. After downloading the Anaconda installer for Windows, locate the downloaded file. It should be a .exe file, typically in your Downloads folder.
  2. Run the Installer. Double-click the installer to launch it. You might need administrative privileges to complete the installation.
  3. Follow the Setup Wizard. The Anaconda installer will open a setup wizard. Follow the instructions, read the license agreement, and agree to proceed.
  4. Choose Install Location. Select where you want to install Anaconda. The default location is usually fine, but you can change it if needed.
  5. Complete the Installation. Proceed with the installation. Once completed, you can optionally install Microsoft VSCode.
  6. Finish and Test Installation. After the installation completes, you can open the Anaconda Navigator from the Start Menu to manage packages and environments graphically.

Installing Anaconda on macOS (Mac)

As some of you may already know, installing applications on Mac is similar to how you would do it on a Windows machine, but with some minor twists:

  1. Locate the Installer. Find the Anaconda installer you downloaded for macOS. This will be a .pkg file in your Downloads folder.
  2. Run the Installer: Double-click the .pkg file to start the installation process. This will open an installation wizard.
  3. Follow the Installation Wizard. The wizard will guide you through the installation process. Click “Continue” to read through the information about Anaconda and agree to the license.
  4. Select the Install Location. You will be prompted to choose an install location for Anaconda. The default location is usually in your home directory, which is recommended for ease of use.
  5. Complete the Installation. Click “Install” to complete the installation. You may need to enter your administrator password to proceed.
  6. Finish and Test Installation. Once the installation is complete, you can find the Anaconda Navigator in your Applications folder. You can launch it to manage packages and environments using a graphical interface.
  7. Launching Jupyter Notebook. Jupyter Notebook can be launched from the Anaconda Navigator or by typing jupyter notebook in your command line or terminal.

Google Colab Explained

Google Colab is a cloud-based platform that facilitates Python scripting and offers free GPU and TPU access, making it ideal for machine learning and data science tasks without the need for local setup. While it excels in accessibility and collaboration, offering a range of pre-installed libraries and integration with Google Drive, it does require a stable internet connection and poses potential concerns with data privacy. Additionally, users face session time limits, which can be restrictive for longer-running processes, making it less suitable for certain extensive computational tasks.

Pros of Using Google Colab

Google Colab offers several significant advantages, particularly in the contexts of ease of use, computational power, and collaboration:

  • Zero Configuration Setup. One of the most appealing aspects of Google Colab is that it requires no setup. Users don’t have to go through the often complex and time-consuming process of configuring a Python environment. This is especially beneficial for beginners or those who want to quickly test out or prototype their Python scripts without worrying about their local environment setup.
  • Access to Powerful Computing Resources. Colab provides free access to high-end hardware like GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units). These resources are crucial for tasks like training machine learning models or processing large datasets, which can be computationally intensive and time-consuming on standard CPUs. This access democratizes the ability to perform advanced computations, enabling users to work on complex machine learning projects without needing expensive personal hardware.
  • Facilitates Collaboration and Sharing. Colab is designed for collaboration. Similar to Google Docs, it allows multiple users to work on the same notebook simultaneously, making it a highly efficient tool for team projects and educational purposes. Sharing work is as simple as sharing a link, and collaborators can add comments and suggestions in real-time.
  • Integration with Other Google Services. Colab seamlessly integrates with Google Drive, making storing, sharing, and accessing notebooks and data convenient. This integration simplifies the data management process, as users can easily import data from and export data to their Drive.
  • Pre-installed Libraries and Tools. Colab comes with a wide range of pre-installed libraries commonly used in data science and machine learning, reducing the need to manage dependencies. This feature allows users to jump straight into their data analysis without the preliminary hurdle of setting up their coding environment.

Overall, Google Colab is great for users who want to work on Python projects with other people. Having common access to Jupyter Notebook files online eliminates the need to pass around files each time someone updates it. In other cases, Colab also makes the most sense for users who want to run computing-intensive processes on Python but aren’t confident about their local computer’s hardware capabilities.

Cons of Using Google Colab

​​The cons of using Google Colab are primarily linked to its nature as a cloud-based platform and the implications that come with it:

  • Requirement of Internet Connectivity: Being entirely cloud-based, Google Colab necessitates a stable and continuous internet connection for use. This dependency can be a significant limitation for users in areas with unreliable internet access or for those who need to work offline. It restricts the ability to work on-the-go, especially in situations where internet access is limited or non-existent.
  • Data Privacy and Security Concerns: Since Colab integrates closely with Google Drive for storing notebooks and data, there are potential concerns regarding data privacy and security. Users working with sensitive or proprietary data might be hesitant to upload such information to a cloud server. The reliance on Google’s infrastructure raises questions about data ownership and access, which are crucial factors for businesses or researchers handling confidential data.
  • Dependence on Google’s Policies and Infrastructure: Using Colab means adhering to Google’s terms of service and any changes they might implement in the future. This includes potential modifications to privacy policies, usage limits, or the availability of resources. Users are essentially entrusting a major corporation with their work, which may not always align with their personal or organizational preferences or ethics.
  • Limited Customization and Control: Compared to a local development environment, Colab offers limited options for customization. Users have less control over the environment, such as the versions of pre-installed libraries or the underlying operating system. This can lead to compatibility issues or challenges in replicating specific setups required for certain projects.

These drawbacks highlight that while Google Colab is a powerful tool for data science and machine learning, especially for those without access to high-end computational resources, it may not suit every user or scenario, particularly where internet connectivity, data privacy, and the need for a highly customizable environment are key considerations.

How to Get on Google Colab

Accessing Google Colab is a hassle-free process that involves just a few simple steps:

  1. Navigate to the Google Colab Website: Go to the Google Colab website by typing research.google.com in your browser’s address bar. This will direct you to the Colab homepage.
  2. Log In with a Google Account: To use Google Colab, you need a Google account. If you’re not already logged in, the website will prompt you to log in with your Google credentials. If you don’t have a Google account, you will need to create one. This is the same account you would use for other Google services like Gmail or Google Drive.
  3. Start a New Notebook: Once logged in, you’ll be presented with the Colab welcome screen. From here, you can start a new notebook by clicking on ‘New Notebook’. This will open a new tab with a fresh notebook, ready for you to write and execute Python code.
  4. Interface Overview: The Google Colab interface is similar to Jupyter Notebooks. It consists of a series of cells, which can be either code cells or text cells. Code cells are for writing and running Python code, while text cells (using Markdown) are for adding notes, documentation, or explanations.
  5. Access Existing Notebooks: If you have existing notebooks on Google Drive, you can open them directly in Colab. You can also upload notebooks from your computer or connect to GitHub to access notebooks stored in a repository.
  6. Save and Share: Your notebooks are automatically saved to your Google Drive in a folder named ‘Colab Notebooks’. You can share your Colab notebooks with others just like you would with a Google Docs.

Overall, the Colab experience should feel very familiar if you’ve been using Google Drive or Workspace for a while now.

Next Steps

And there you have it. Setting up Python on your computer is a straightforward process whether you prefer to do it with Anaconda or Colab. The next step is to start exploring the interface of your preferred platform so you can get familiar with it. In the next posts, we’ll cover basic operations that will become your bread and butter when working with data science projects.

About Glen Dimaandal

Glen Dimaandal
Glen Dimaandal is a data scientist from the Philippines. He has a post-graduate degree in Data Science and Business Analytics from the prestigious McCombs School of Business in the University of Texas, Austin. He has nearly 20 years of experience in the field as he worked with major brands from the US, UK, Australia and the Asia-Pacific. Glen is also the CEO of SearchWorks.PH, the Philippines’ most respected SEO agency.
Glen Dimaandal
Glen Dimaandal is a data scientist from the Philippines. He has a post-graduate degree in Data Science and Business Analytics from the prestigious McCombs School of Business in the University of Texas, Austin. He has nearly 20 years of experience in the field as he worked with major brands from the US, UK, Australia and the Asia-Pacific. Glen is also the CEO of SearchWorks.PH, the Philippines’ most respected SEO agency.
ARTICLE & NEWS

Check our latest news

In our last post on Python programming for data science, we discussed the list data structure…

In the previous entry, we touched upon commonly occurring distributions: Bernoulli distribution, binomial distribution, uniform distribution,…

In our last coding post, we discussed the concept of data structures in Python and the…

Ready to get started?

Reveal the untapped potential of your data. Start your journey towards data-driven decision making with Griffith Data Innovations today.