Python Why Python is best for Data Sciences Python Efficient Code Data Types For Data Science Working With CSV Counter built-in class most_common() - collections module OrderedDict power feature - subclass namedtuple is a powerful tool From String to datetime DateTime Components TimeZone in Action TimeDelta - Time Travel with timedelta Parsing time with pendulum Data Manipulation with Pandas Creating DataFrames with Pandas Creating DataFrames with Dictionaries in Pandas DataFrame With CSV File Summary statistics Summarizing numerical data Summarizing dates The .agg() method Summaries on multiple columns Multiple summaries Cumulative sum Cumulative statistics Dropping duplicate names Dropping duplicate pairs Summaries by group Multiple grouped summaries Grouping by multiple variables Many groups, many summaries Pivot tables Group by to pivot table Different statistics in a pivot table Multiple statistics in pivot table Pivot on two variables Filling missing values in pivot tables Summing with pivot tables Explicit indexes Slicing lists Sort the index before slice Slicing the outer index level Slicing the inner index levels badly Slicing the inner index levels correctly Slicing columns Slice twice Slicing by dates Slicing by partial dates Subsetting by row/column number Slicing - .loc[] + slicing is a power combo The axis argument Calculating summary stats across columns Visualizing data Histograms Bar plots Line plots Rotating axis labels Scatter plots Layering plots Plot with Legend Plot with Transparency Avocados Missing values Detecting missing values Detecting any missing values with .isna().any() Detecting any missing values Counting missing values Plotting missing values Removing missing values Replacing missing values List of dictionaries - by row Dictionary of lists - by column DataFrame manipulation Built-in functions Defining a function Function parameters Return values from functions Docstrings Multiple Parameters and Return Values Basic ingredients of a function Global vs. local scope Nested functions Returning functions Using nonlocal Default and flexible arguments Lambda functions Anonymous functions Introduction to error handling The float() function Passing an incorrect argument Passing valid arguments Passing invalid arguments Errors and exceptions Errors and exceptions - 2 What is iterate Iterating with a for loop Iterators vs. iterables Iterating over iterables: next() Iterating at once with * Iterating with dictionaries Iterating with file connections Using enumerate() enumerate() and unpack Using zip() zip() and unpack Print zip with * Using iterators to load large files into memory Loading data in chunks Iterating over data Populate a list with a for loop A list comprehension For loop And List Comprehension List comprehension with range() Nested loops Conditionals in comprehensions Dict comprehensions Generator expressions List comprehensions vs. generators Conditionals in generator expressions Build generator function Using generator function Generators for the large data limit Build a generator function Using pandas read_csv iterator for streaming data Building with builtins Built-in function: range() with Efficient Code Built-in function: enumerate() with Efficient Code Built-in function: map() with Efficient Code The power of NumPy arrays with Efficient Code NumPy array broadcasting NumPy array boolean indexing Why should we time our code? Using %timeit %timeit output Specifying number loops Using %timeit in line magic mode Using %timeit in cell magic mode Saving output Comparing times Code profiling for runtime %lprun output Code profilling for memory usage %mprun output Efficiently Combining, Counting, and iterating Combining objects Combining objects with zip Counting with loop collections.Counter() The itertools module Combinations with loop itertools.combinations() Comparing objects with loops Set method difference Set method symmetric difference Set method union Uniques with sets Beneifits of eleiminating loops Eliminate loops with NumPy Moving calculations above a loop Using holistic conversions Introduction to pandas DataFrame iteration Calculating win percentage Adding win percentage to DataFrame Iterating with .iloc Iterating with .iterrows() .itertuples() Iterating with .itertuples() pandas .apply() method Dates in Python Attributes of a date Finding the weekday of a date Math with Dates Incrementing variables += Turning dates into strings ISO 8601 format with Exmples Format strftime Adding time to the mix Replacing parts of a datetime Printing datetimes Parsing datetimes with strptime Working with durations Creating timedeltas Negative timedeltas UTC offsets Adjusting timezone vs changing tzinfo Time zone database Starting Daylight Saving Time Ending Daylight Saving Time Reading date and time data in Pandas Loading datetimes with parse_dates Timezone-aware arithmetic Summarizing datetime data in pandas Additional datetime methods in Pandas Timezones in Pandas All datetime operations in Pandas All parts of Pandas Additional datetime methods in Pandas Introduction to string manipulation Concatenation Indexing Slicing Stride String operations Adjusting cases Splitting Joining Stripping characters Finding and replacing Finding substrings Index function Counting occurrences Replacing substrings Positional formatting string formatting Methods for formatting Positional formatting Reordering values Named placeholders Format specifier Formatting datetime Formatted string literal - f-strings Type conversion Index lookups Escape sequences Inline operations Calling functions Template method Substitution The re module Supported metacharacters Repeated characters Quantifiers in re module Regex metacharacters Special characters OR operator in re Module OR operand in re module Greedy vs. nongreedy matching Grouping and capturing re module Pipe | re module Non-capturing groups Backreferences Numbered groups Named groups Lookaround Look-ahead Positive look-ahead Negative look-ahead Look-behind Positive look-behind Negative look-behind Web Scraping With Python Slashes and Brackets in web scrapping Introduction to the scrapy Selector Setting up a Selector Selecting Selectors Extracting Data from a SelectorList CSS Locators Attributes in CSS Selectors with CSS Text Extraction Crawl A Classy Spider Docstrings Docstring formats Don't repeat yourself (DRY) Pass by assignment Immutable or Mutable? Using context managers The "yield" keyword Nested contexts Two ways to define a context manager Handling errors Functions as objects Functions as variables Lists and dictionaries of functions Referencing a function Functions as arguments Defining a function inside another function Functions as return values The global keyword The nonlocal keyword Attaching nonlocal variables to nested functions Closures and deletion Closures and overwriting Definitions - nested function Definitions - nonlocal variables Decorators decorator look like? The double_args decorator Time a function Using timer() When to use decorators with timer() Decorators and metadata The timer decorator Access to the original function Decorators that take arguments run_n_times() A decorator factory Timeout(): a real world example Querying Python interpreter's memory usage Allocating memory for an array Allocating memory for a computation Querying array memory Usage Querying DataFrame memory usage Using pd.read_csv() with chunksize Examining a chunk Filtering a chunk Chunking & filtering together Using pd.concat() Plotting the filtered results Managing Data with Generators Filtering in a list comprehension Filtering & summing with generators Examining consumed generators Reading many files Examining a sample DataFrame Aggregating with Generators Computing the fraction of long trips Delaying Computation with Dask Composing functions Deferring computation with `delayed` Visualizing a task graph Renaming decorated functions Using decorator @-notation Deferring Computation with Loops Aggregating with delayed Functions Computing fraction of long trips with `delayed` functions Chunking Arrays in Dask Working with Numpy arrays Working with Dask arrays Aggregating in chunks Aggregating with Dask arrays Dask array methods/attributes Timing array computations Computing with Multidimensional Arrays A Numpy array of time series data Reshaping time series data Reshaping: Getting the order correct! Using reshape: Row- & column-major ordering Indexing in multiple dimensions Aggregating multidimensional arrays Broadcasting rules Connecting with Dask HDF5 format (Hierarchical Data Format version 5) Extracting Dask array from HDF5 Aggregating while ignoring NaNs Producing a visualization of data_dask Stacking arrays Stacking one-dimensional arrays Stacking two-dimensional arrays Putting array blocks together Analyzing Earthquake Data Using HDF5 files for analyzing earthquake data Extracting Dask array from HDF5 for Analyzing Earthquake Data Aggregating while ignoring NaNs for Analyzing Earthquake Data Producing a visualization of data_dask for Analyzing Earthquake Data Stacking arrays for Analyzing Earthquake Data Stacking one-dimensional arrays for Analyzing Earthquake Data Stacking two-dimensional arrays for Analyzing Earthquake Data Putting array blocks together for Analyzing Earthquake Data Using Dask DataFrames Reading CSV For Dask DataFrames Reading multiple CSV files For Dask DataFrames Building delayed pipelines Compatibility with Pandas API Timing DataFrame Operations Timing I/O & computation: Pandas Is Dask or Pandas appropriate? Building Dask Bags & Globbing Sequences to bags Reading text files Glob expressions Using Python's glob module Functional Approaches using Dask Bags Functional programming Functional programming - Using map Functional programming - Using Filter Functional Approaches - Using dask.bag.map Functional Approaches - Using dask.bag.filter Functional Approaches - Using .str & string methods JSON data files Using json module JSON Files into Dask Bags Plucking values Merging DataFrames Dask DataFrame pipelines Repeated reads & performance Using persistence Python, data science, & software engineering Software engineering concepts Django Introduction Datatypes Lists Combining Lists Finding and Removing Elements in a List Iterating and Sorting Tuples Zipping and Unpacking More Unpacking in Loops Enumerating positions Sets for Unordered and Unique Data with Tuples in Python Set Creating Sets in Python: Harnessing the Power of Unique Collections Modifying Sets in Python: Adding and Removing Elements with Ease Removing Data from Sets in Python: Streamlining Set Operations Exploring Set Operations in Python: Uncovering Similarities among Sets Set Operations in Python: Unveiling Differences among Sets Exploring Dictionaries in Python: A Key-Value Data Structure Creating and Looping Through Dictionaries in Python: A Comprehensive Guide Safely Finding Values in Python Dictionaries: A Guide to Avoiding Key Errors Safely Finding Values in Python Dictionaries: Advanced Techniques for Key Lookup Dictionaries-Working with Nested Data in Python: Exploring Hierarchical Structures Adding and Extending Python Dictionaries: Flexible Data Manipulation Popping and Deleting from Python Dictionaries: Managing Key-Value Removal Working with Dictionaries More Pythonically: Efficient Data Manipulation Checking Dictionaries for Data: Effective Data Validation in Python Working with CSV Files in Python: Simplify Data Processing and Analysis Creating a Dictionary from a File in Python: Simplify Data Mapping and Access Counting Made Easy in Python: Harness the Power of Counting Techniques Exploring the Collections Module in Python: Enhance Data Structures and Operations Understanding the Counter Class in Python: Simplify Counting and Frequency Analysis Working with Dictionaries of Unknown Structure using defaultdict in Python Advanced Usage of defaultdict in Python for Flexible Data Handling Maintaining Dictionary Order with OrderedDict in Python Harnessing the Power of OrderedDict's Advanced Features in Python Unleashing the Power of namedtuple in Python Leveraging the Power of namedtuples in Python Working with Datetime Components and Current Time in Python Exploring Datetime Components in Python Understanding "now" in Python's Datetime Module Exploring Timezones in Python's Datetime Module Time Travel in Python: Adding and Subtracting Time HELP! Libraries to Make Python Development Easier Parsing Time with Pendulum: Simplify Your Date and Time Operations Timezone Hopping with Pendulum: Seamlessly Manage Time across Different Timezones Humanizing Differences: Making Time Intervals More Readable with Pendulum

DataFrame With CSV File


Reading a CSV file and performing data manipulation is a common task in data analysis and machine learning. Here are maximum steps to help you get started:

servers_info.csv

server_name location os_version hd ram date
server1 New York 2016 100 16 2019-07-01
server2 London 2012 150 8 2019-07-02
server3 Paris 2010 120 32 2019-07-03
server4 Miami 2019 100 16 2019-07-04
server5 Liverpool 2016 300 6 2019-07-05
server6 London 2016 100 16 2019-07-01
server7 Amsterdum 2012 150 8 2019-07-02
server8 Munich 2010 120 32 2019-07-03
server9 Berlin 2019 100 16 2019-07-04
server10 New York 2016 300 6 2019-07-05

 

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. A CSV file is a type of text file used to store tabular data, where each row represents a record and each column represents a field.

In Python, the Pandas library provides a convenient way to read CSV files into DataFrames. Here's how you can create a DataFrame with a CSV file:

import pandas as pd
# Specify the file path
file_path = r'C:\\servers_info.csv'
# Read the CSV file and create a data frame
df = pd.read_csv(file_path)
# Display the data frame
print(df)

Output:

server_name   location  os_version   hd  ram        date
0     server1   New York        2016  100   16  2019-07-01
1     server2     London        2012  150    8  2019-07-02
2     server3      Paris        2010  120   32  2019-07-03
3     server4      Miami        2019  100   16  2019-07-04
4     server5  Liverpool        2016  300    6  2019-07-05
5     server6     London        2016  100   16  2019-07-01
6     server7  Amsterdum        2012  150    8  2019-07-02
7     server8     Munich        2010  120   32  2019-07-03
8     server9     Berlin        2019  100   16  2019-07-04
9    server10   New York        2016  300    6  2019-07-05

 

print(df.head())

  server_name   location  os_version   hd  ram        date
0     server1   New York        2016  100   16  2019-07-01
1     server2     London        2012  150    8  2019-07-02
2     server3      Paris        2010  120   32  2019-07-03
3     server4      Miami        2019  100   16  2019-07-04
4     server5  Liverpool        2016  300    6  2019-07-05

head() is a method used in Python pandas library to view the first few rows of a DataFrame or Series. By default, it displays the first 5 rows, but you can specify the number of rows you want to see by passing an integer argument to the method. This method is useful for quickly inspecting the data in a DataFrame or Series and getting a sense of its structure and content.

print(df.info())

<class 'pandas.core.frame.DataFrame'>  

RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   server_name  10 non-null     object
 1   location     10 non-null     object
 2   os_version   10 non-null     int64
 3   hd           10 non-null     int64
 4   ram          10 non-null     int64
 5   date         10 non-null     object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes
 

.info() is a method in the pandas library in Python used to obtain a summary of a DataFrame or Series. It provides information about the data type of each column, the number of non-null values, and the memory usage of the DataFrame. This method is useful for understanding the structure and content of a dataset, as well as identifying missing or incorrect data. It can also be used to optimize memory usage by identifying opportunities to convert data types or remove unnecessary columns.

print(df.shape)

.shape is an attribute in the pandas library in Python used to return the dimensions (number of rows and columns) of a DataFrame or Series as a tuple. For a DataFrame, the first element of the tuple represents the number of rows, while the second element represents the number of columns. For a Series, the tuple contains only one element, representing the length of the Series. This attribute is useful for quickly checking the size of a dataset and for performing operations that require knowledge of the dataset's dimensions, such as reshaping, concatenation, or indexing.

(10, 6)
print(df.describe())
        os_version          hd        ram
count    10.000000   10.000000  10.000000
mean   2014.600000  154.000000  15.600000
std       3.373096   79.330532   9.651713
min    2010.000000  100.000000   6.000000
25%    2012.000000  100.000000   8.000000
50%    2016.000000  120.000000  16.000000
75%    2016.000000  150.000000  16.000000

.describe() is a method in the pandas library in Python used to generate descriptive statistics of a DataFrame or Series. It provides information about the central tendency, dispersion, and shape of the dataset's distribution, including count, mean, standard deviation, minimum, maximum, and quartiles. This method is useful for gaining insights into the data and identifying any outliers or anomalies. It can also be used to compare different datasets or subsets of data to understand how they differ in terms of their distribution and summary statistics.

print(df.values)
[['server1' 'New York' 2016 100 16 '2019-07-01']
 ['server2' 'London' 2012 150 8 '2019-07-02']
 ['server3' 'Paris' 2010 120 32 '2019-07-03']
 ['server4' 'Miami' 2019 100 16 '2019-07-04']
 ['server5' 'Liverpool' 2016 300 6 '2019-07-05']
 ['server6' 'London' 2016 100 16 '2019-07-01']
 ['server7' 'Amsterdum' 2012 150 8 '2019-07-02']
 ['server8' 'Munich' 2010 120 32 '2019-07-03']
 ['server9' 'Berlin' 2019 100 16 '2019-07-04']
 ['server10' 'New York' 2016 300 6 '2019-07-05']]

.values is a method in the pandas library in Python used to return a NumPy array of the values in a DataFrame or Series. It returns a 2D NumPy array for a DataFrame and a 1D NumPy array for a Series. This method is useful for converting a pandas DataFrame or Series to a NumPy array for compatibility with other libraries and functions that accept NumPy arrays. It can also be used to access and manipulate the underlying data in a DataFrame or Series directly, without having to work through the pandas library. However, it is generally recommended to use pandas methods for data manipulation and analysis, as they provide a more convenient and efficient interface for working with structured data.

print(df.columns)
Index(['server_name', 'location', 'os_version', 'hd', 'ram', 'date'], dtype='object')

.columns is an attribute in the pandas library in Python used to return a list of the column names in a DataFrame. It is a useful method for quickly inspecting the column names of a dataset and for selecting specific columns for further analysis. It can also be used to rename columns by assigning a new list of column names to the attribute. For example, if you have a DataFrame called df, you can rename its columns by assigning a list of new column names to df.columns.

print(df.index)

RangeIndex(start=0, stop=10, step=1)

.index is an attribute in the pandas library in Python used to return the index labels of a DataFrame or Series. For a DataFrame, the index labels represent the row labels, while for a Series, the index labels represent the labels for each element in the Series. This attribute is useful for accessing and manipulating the index labels directly, such as selecting specific rows or reordering the rows based on the index labels. It can also be used to set a new index by assigning a new index object to the attribute, for example, by assigning a list of new index labels to df.index .

print(df.sort_values("server_name"))

use sortvalues with .head()

print(df.sort_values("server_name").head())
  server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01
9    server10  New York        2016  300    6  2019-07-05
1     server2    London        2012  150    8  2019-07-02
2     server3     Paris        2010  120   32  2019-07-03
3     server4     Miami        2019  100   16  2019-07-04

.sort_values() is a method in the pandas library in Python used to sort a DataFrame or Series by one or more columns. It can be used to sort the data in ascending or descending order, and can handle missing values in various ways. By default, it sorts the DataFrame or Series in ascending order based on the values in the specified column(s). It can also sort based on multiple columns by passing a list of column names to the method. This method is useful for quickly reordering and prioritizing the data in a dataset based on specific criteria. It can also be used to identify patterns and relationships in the data by sorting on different columns and examining the resulting patterns.

Sorting in descending order

print(df.sort_values("server_name", ascending=False).head())
  server_name   location  os_version   hd  ram        date
8     server9     Berlin        2019  100   16  2019-07-04
7     server8     Munich        2010  120   32  2019-07-03
6     server7  Amsterdum        2012  150    8  2019-07-02
5     server6     London        2016  100   16  2019-07-01
4     server5  Liverpool        2016  300    6  2019-07-05

.sort_values(ascending=False) is a method in the pandas library in Python used to sort a DataFrame or Series in descending order based on one or more columns. It works the same as the .sort_values() method, but with the addition of the ascending=False parameter, which reverses the default ascending order to descending order. By default, it sorts the DataFrame or Series in descending order based on the values in the specified column(s). It can also sort based on multiple columns by passing a list of column names to the method. This method is useful for quickly reordering and prioritizing the data in a dataset based on specific criteria in descending order. It can also be used to identify patterns and relationships in the data by sorting on different columns and examining the resulting patterns in descending order.

Sorting by multiple variables

print(df.sort_values(["server_name", "location"]).head())

server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01
9    server10  New York        2016  300    6  2019-07-05
1     server2    London        2012  150    8  2019-07-02
2     server3     Paris        2010  120   32  2019-07-03
3     server4     Miami        2019  100   16  2019-07-04

.sort_values() with multiple variables is a method in the pandas library in Python used to sort a DataFrame or Series based on multiple columns. It can sort based on two or more columns by passing a list of column names to the method. The first column name in the list is the primary sort column, followed by the secondary, tertiary, and so on. If two or more rows have the same value in the primary sort column, then the method will sort those rows based on the secondary column, and so on. This method is useful for prioritizing and sorting the data in a dataset based on multiple criteria. It can be used to identify patterns and relationships in the data by sorting on different combinations of columns and examining the resulting patterns.

print(df.sort_values(["server_name", "location"], ascending=[True, False]).head())
  server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01
9    server10  New York        2016  300    6  2019-07-05
1     server2    London        2012  150    8  2019-07-02
2     server3     Paris        2010  120   32  2019-07-03
3     server4     Miami        2019  100   16  2019-07-04

Subsetting columns

print(df["server_name"])
0     server1
1     server2
2     server3
3     server4
4     server5

......

Subsetting columns is a technique in the pandas library in Python used to select and extract specific columns from a DataFrame or Series. It involves specifying the column name(s) or index location(s) of the desired column(s) within the square brackets of the DataFrame or Series object. For example, if you have a DataFrame called df with columns named "name", "age", and "gender", you can extract only the "name" and "age" columns by using the code df[['name', 'age']]. Alternatively, you can use the index location of the columns, such as df.iloc[:, 0:2], which would extract the first two columns of the DataFrame. This technique is useful for focusing on specific columns of interest and reducing the amount of data that needs to be processed, especially when working with large datasets

Subsetting multiple columns

print(df[["server_name","location"]])
Subsetting rows

print(df["ram"] > 8)

0     True
1    False
2     True
3     True
4    False
5     True

The command df["ram"] > 8 is used to create a boolean mask in a pandas DataFrame called df based on a condition, where the condition is checking if the values in the "ram" column are greater than 8.

Here's a brief description of the command:

  • df["ram"] refers to the "ram" column in the DataFrame df.
  • > 8 is a comparison operator that checks if the values in the "ram" column are greater than 8. This will result in a boolean mask with True values where the condition is met and False values where the condition is not met.

The resulting output of the command will be a series of boolean values, with True values corresponding to rows where the value in the "ram" column is greater than 8, and False values where the value is less than or equal to 8. This boolean mask can be used to subset or filter the original DataFrame to only include the rows where the condition is met, such as df[df["ram"] > 8].

Subsetting based on text data

 

Subsetting based on text data is a technique used in pandas to extract specific rows from a DataFrame based on the content of a text column. This involves using boolean indexing to create a boolean mask that checks if a specific text pattern or substring exists in a column of text data.

For example, if you have a DataFrame with a column of text data called "Name", you can extract only the rows where the "Name" column contains the word "John" by using the code df[df["Name"].str.contains("John")]. The str.contains() method searches for the specified text pattern within the "Name" column and returns a boolean mask indicating which rows match the condition. The resulting boolean mask is then used to subset the original DataFrame to only include the rows where the condition is met.

This technique is useful when working with text data and can be used to filter and extract specific subsets of data based on patterns or keywords within the text. Other string methods, such as str.startswith(), str.endswith(), and str.match(), can also be used for more specific string operations.

print(df[df["location"] == "New York"])

  server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01
9    server10  New York        2016  300    6  2019-07-05

The command df[df["location"] == "New York"] is used to subset a pandas DataFrame called df based on a condition, where the condition is checking if the values in the "location" column are equal to "New York".

Here's a brief description of the command:

  • df["location"] refers to the "location" column in the DataFrame df.
  • == "New York" is a comparison operator that checks if the values in the "location" column are equal to "New York". This will result in a boolean mask with True values where the condition is met and False values where the condition is not met.

The boolean mask is then used to subset the original DataFrame by using it inside square brackets, resulting in a new DataFrame that only includes the rows where the "location" column is equal to "New York". This command can be used to filter and focus on specific subsets of data within the original DataFrame.

Subsetting based on dates

Subsetting based on dates is a technique used in pandas to extract specific rows from a DataFrame based on date values in a column. This involves using boolean indexing to create a boolean mask that checks if a date falls within a certain range or satisfies a specific condition.

For example, if you have a DataFrame with a column of date data called "Date", you can extract only the rows where the "Date" column falls within a specific date range by using the code df[(df["Date"] >= start_date) & (df["Date"] <= end_date)]. Here, start_date and end_date are variables representing the start and end dates of the desired date range. The & operator is used to combine the two boolean conditions, and the resulting boolean mask is then used to subset the original DataFrame to only include the rows where the condition is met.

This technique is useful when working with time series data and can be used to filter and extract specific subsets of data based on date ranges or other date-related conditions. Other date/time functions, such as pd.to_datetime(), can also be used to convert strings or other data types to date values for use in date-based operations.

print(df[df["date"] < "2019-07-04"])

  server_name   location  os_version   hd  ram        date
0     server1   New York        2016  100   16  2019-07-01
1     server2     London        2012  150    8  2019-07-02
2     server3      Paris        2010  120   32  2019-07-03
5     server6     London        2016  100   16  2019-07-01
6     server7  Amsterdum        2012  150    8  2019-07-02
7     server8     Munich        2010  120   32  2019-07-03

Subsetting based on multiple conditions

Subsetting based on multiple conditions is a technique used in pandas to extract specific rows from a DataFrame that satisfy multiple criteria. This involves using boolean indexing to create a boolean mask that checks if a row satisfies two or more conditions.

For example, if you have a DataFrame with columns called "Gender" and "Age", you can extract only the rows where "Gender" is "Female" and "Age" is greater than or equal to 30 by using the code df[(df["Gender"] == "Female") & (df["Age"] >= 30)]. Here, the & operator is used to combine the two boolean conditions, and the resulting boolean mask is then used to subset the original DataFrame to only include the rows where both conditions are met.

This technique can be used to filter and extract specific subsets of data based on multiple criteria, which can be especially useful in large datasets where only a small subset of the data is of interest. Other logical operators, such as | for "or" conditions and ~ for negation, can also be used to create more complex boolean expressions for subsetting data.

is_name = df["server_name"] == "server1"

is_loc = df["location"] == "New York"

print(df[is_name & is_loc])

or

print(df[(df["server_name"] == "server1") & (df["location"] == "New York")])
  server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01

Subsetting using .isin()

Subsetting using .isin() is a technique used in pandas to extract specific rows from a DataFrame that match a set of values in a particular column. This involves using the .isin() method, which checks whether each element in a column is contained in a set of specified values and returns a boolean mask.

For example, if you have a DataFrame with a column of categorical data called "Fruit", you can extract only the rows where the "Fruit" column contains the values "Apple" or "Banana" by using the code df[df["Fruit"].isin(["Apple", "Banana"])]. The .isin() method checks whether each element in the "Fruit" column is contained in the set of specified values, which is ["Apple", "Banana"] in this case, and returns a boolean mask indicating which rows match the condition. The resulting boolean mask is then used to subset the original DataFrame to only include the rows where the condition is met.

This technique is useful when working with categorical data and can be used to filter and extract specific subsets of data based on a specific set of values in a column. Other similar methods, such as .str.contains() for text data and .between() for numerical data, can also be used to extract data based on more specific conditions.

loc = (df["location"].isin(["New York", "London"]))

print(df[loc])

  server_name  location  os_version   hd  ram        date
0     server1  New York        2016  100   16  2019-07-01
1     server2    London        2012  150    8  2019-07-02
5     server6    London        2016  100   16  2019-07-01
9    server10  New York        2016  300    6  2019-07-05

New columns

df["avg_new"] = df["ram"]/100

print(df)
  server_name   location  os_version   hd  ram        date  avg_new
0     server1   New York        2016  100   16  2019-07-01     0.16
1     server2     London        2012  150    8  2019-07-02     0.08
2     server3      Paris        2010  120   32  2019-07-03     0.32
3     server4      Miami        2019  100   16  2019-07-04     0.16
4     server5  Liverpool        2016  300    6  2019-07-05     0.06

Multiple manipulations

Multiple manipulations is a term used to describe the process of combining multiple data manipulation techniques in pandas to extract, transform, and analyze specific subsets of data from a larger dataset.

This often involves chaining multiple methods and functions together in a specific order to achieve the desired output. For example, you can use techniques such as filtering, grouping, aggregating, and sorting to extract specific subsets of data based on multiple conditions, and then transform or analyze the data using mathematical or statistical functions.

Multiple manipulations can be particularly useful when working with large datasets and complex data structures, as it allows you to extract and analyze specific subsets of data based on multiple criteria, and then transform or summarize the data to gain insights and make informed decisions. Pandas offers a wide range of built-in functions and methods to perform various data manipulations, and the ability to chain them together makes it a powerful tool for data analysis and manipulation.

i_ram = df[df["ram"] < 16]

i_loc = i_ram.sort_values("location", ascending=False)

print(i_loc[["server_name","location","ram"]])
  server_name   location  ram
9    server10   New York    6
1     server2     London    8
4     server5  Liverpool    6
6     server7  Amsterdum    8

The given code first creates a subset of the original DataFrame df called i_ram by using boolean indexing to extract only the rows where the "ram" column has a value less than 16.

The next line of code then creates a new DataFrame called i_loc by sorting the rows of i_ram in descending order based on the values in the "location" column using the .sort_values() method. Here, the ascending=False parameter is passed to sort the rows in descending order of the "location" column.

Overall, this code filters the original DataFrame based on a specific condition and then sorts the resulting subset based on a specific column to create a new DataFrame with only the desired rows and order.

Introduction to working with CSV files in Python Introduction to the Counter class in Python Introduction to the most_common() method in Python Introduction to the OrderedDict subclass in Python introduction to the namedtuple function in Python Introduction to datetime in Python Introduction to DateTime components in Python Introduction to TimeZone in Python Introduction to TimeDelta in Python Introduction to parsing time with Pendulum in Python Introduction to Data Manipulation with Pandas in Python Introduction to creating DataFrames with Pandas in Python Introduction to creating DataFrames with Dictionaries in Pandas in Python Introduction to creating a DataFrame with CSV file in Python Introduction to summary statistics and their applications in Python Introduction to numerical data summarization and its importance in data analysis with Python Introduction to summarizing dates in Python and its importance in data analysis and visualization What is the .agg() method in Python and how it can be used for data aggregation in Pandas Using Pandas to calculate summary statistics on multiple columns in Python Using Pandas to compute multiple summary statistics on a dataset in Python Using the cumsum() function to calculate cumulative sum in Python introduction to cumulative statistics in Python How to drop duplicate names in Python using Pandas How to drop duplicate pairs in Python using Pandas Grouping data in Python: How to summarize data by group Grouping and summarizing data in Python: How to compute multiple summaries by group Advanced data analysis with Pandas in Python: How to group data by multiple variables simultaneously Advanced data aggregation in Python: How to summarize data across many groups using Pandas Mastering Pivot Tables in Python: A Comprehensive Guide to Data Analysis and Visualization Transforming GroupBy Data to Pivot Tables in Python: Techniques and Applications Analyzing Data with Pivot Tables in Python: Summarizing and Comparing Multiple Statistics Analyzing Data with Pivot Tables in Python: Summarizing and Comparing Multiple Statistics Advanced Data Analysis with Pandas: Pivot Tables on Multiple Variables Handling Missing Data in Python Pivot Tables: Techniques and Best Practices Summing with pivot tables Python indexes explicitly labeled for more descriptive data Python list slicing Sort the index before slice in python Index Level Slicing in Python Slicing inner index levels improperly in python Correctly slicing inner index levels in python Column slicing in python Slicing twice to extract a subset of data in python Slicing time series data by dates in python Slicing time series data by partial dates in python Using row/column numbers to subset data in python Basic slicing in pandas with the [] operator Understanding the axis argument in python Computing summary statistics along the columns axis in python Plotting data using Pythons visualization libraries Plotting histograms in python Bar plots with Seaborn in python Creating line plots in python Rotating axis labels in matplotlib using Python Creating scatter plots in python Combining multiple plots in python using layering Adding a legend to your plot in python Adding transparency to plots in python Avocados in python Handling missing values in python Methods for detecting missing values in python Finding missing values using .isna() method in python Detecting any missing values in python Counting NaN values in python Using seaborn to plot missing values in python Python for removing missing values Python for replacing missing values Creating a list of dictionaries by row in Python Creating a dictionary of lists by column in Python Pythonic way of manipulating DataFrames