When you would make use of Pandas ahead, you might need to deal with DataFrames more often. So now, it becomes crucial for us to understand what is a DataFrame. By the way, it’s not like you have never seen a DataFrame (except if you haven’t gone through the tutorial on importing pandas in this tutorial series). Let’s understand what is a DataFrame now.
What is DataFrame in Pandas
What is a DataFrame?
A DataFrame can be considered as a 2-dimensional mutable tabular data structure, which is very widely used in pandas. Simply speaking, you would see it as something like an Excel Table. Here, the columns can be of potentially different types. Now, it’s time for us to create a DataFrame in our Python program. Here is a sample program, through which, we are going to create a DataFrame.
import pandas as pd
placement_data = {
"Branch": ["CSE", "EEE", "IT", "Mechanical"],
"Number of students placed": [25, 16, 34, 32]
}
placement_df = pd.DataFrame(placement_data)
print(placement_df)
If you see the output, you will find this as a DataFrame, which looks like an Excel table. It has rows and columns. Later on, we are going to be doing many things on DataFrames, as right now, if we try to do too much on the DataFrames, it can be overwhelming for us, but as time goes on, we will learn more about DataFrames, and about some operations.
Let’s try to discuss some arguments that we need to give to DataFrame(). Here are some of the arguments –
- data – It may be ndarray, series, dictionary, DataFrame, and more.
- index – This is for the row labels
- columns – This is for the column labels
- dtype – The data type of each column
These are some of the arguments that you might find being given while creating a DataFrame. While creating a DataFrame, you may pass –
- List
- Series
- dictionary
- DataFrames
- NumPy Arrays
Now, let’s create another DataFrame –
import pandas as pd
person_data = [["John", 40000], ["Jack", 30000], ["Jill", 25000], ["Jane", 50000], ["Jim", 50000]]
person_df = pd.DataFrame(person_data, columns=["Name", "Salary"])
print(person_df)
As you can see, this time, we had passed a list, which contained more lists. This way, we were able to make a DataFrame with 2 columns. While creating the DataFrame, we are passing the column labels, which give the column headings as Name and Salary respectively. Try removing the columns argument, and see the output(you shall spot the difference).
What if we need to add another column to the DataFrame? Well, that is also very simple, so now let’s have a quick look at that –
Adding a column to dataframe
Continuing the previous program where we created the person_df, let’s try to add a new column with the name “Department”, which should contain the department names of the respective employees. Here is the complete program –
import pandas as pd
person_data = [["John", 40000], ["Jack", 30000], ["Jill", 25000], ["Jane", 50000], ["Jim", 50000]]
person_df = pd.DataFrame(person_data, columns=["Name", "Salary"])
# Let's add a new column -
person_df["Department"] = ["IT", "Sales", "Services", "IT", "HR"]
print(person_df)
If you try to print the DataFrame now, you will find the new column added to the DataFrame. Well, this way, we have successfully created another DataFrame. You can even explore and try how to create DataFrames using other things.
Accessing the rows and columns of the DataFrame
When we are creating DataFrame, we might need to access the rows and columns, and now, we need to see how to do it. Well, later on, you might get to explore more about it, but right now, let’s have a quick look at how can you access the rows and columns. We are going to make use of the loc property for the DataFrames. Here is a sample program –
import pandas as PD
person_data = [["John", 40000], ["Jack", 30000], ["Jill", 25000], ["Jane", 50000], ["Jim", 50000]]
person_df = pd.DataFrame(person_data, columns=["Name", "Salary"])
print(person_df.loc[0])
Output –
Name John
Salary 40000
Name: 0, dtype: object
The output may seem a little bit confusing, but it can be understood that we are getting the names of the columns “Name” and “Salary”, and their values for the first row, as the first row has the label 0.
As you might have seen a new thing “loc” is being used here. We will study the loc attribute later on in detail, but let’s try to have a quick look at what it means here. It has to deal with label-based indexing. So, in the above example, we are getting the row with label 0. Let’s try with some different indexes.
import pandas as pd
person_data = [["John", 40000], ["Jack", 30000], ["Jill", 25000], ["Jane", 50000], ["Jim", 50000]]
person_df = pd.DataFrame(person_data, index=['row1', 'row2', 'row3', 'row4', 'row5'], columns=["Name", "Salary"])
print(person_df.loc['row1'])
The output for this code is similar to the above one. The thing is that we could give an index, and then we passed that index to loc, to get the data of row1.
import pandas as pd
person_data = [["John", 40000], ["Jack", 30000], ["Jill", 25000], ["Jane", 50000], ["Jim", 50000]]
person_df = pd.DataFrame(person_data, index=['row1', 'row2', 'row3', 'row4', 'row5'], columns=["Name", "Salary"])
print(person_df.loc['row1':'row3'])
This above code gets us the data from row1, row2, and row3. This is label-based indexing. Let’s not get into loc for now, because that’s a whole other concept, and deserves separate attention.
Understanding Named Indexes
Just in the previous example, we could see that the indexes were given as strings(like ‘row1’, ‘row2’, etc). So, with the index argument for the DataFrame, we can name our indexes. Here is another example of the same –
import pandas as PD
data = {
'Math': [85, 90, 78],
'Science': [80, 88, 85],
'English': [75, 82, 79]
}
# Let's create a DataFrame with named indexes
df = pd.DataFrame(data, index=['Arun', 'Tarun', 'Varun'])
print(df)
If you execute the above program, you should see that the labels are now Arun, Tarun, and Varun. If you remove the index argument, you will find that the labels are 0, 1, and 2 (Try it now). So, named indexes as at times useful, so that it becomes easy to read through the data, and also to reference it if needed.
For example, if you need Varun’s marks, giving the index name as Varun would make more sense than 2 (which would have been default). You can also easily access the data through named indexes. Here is an example of the same.
import pandas as PD
data = {
'Math': [85, 90, 78],
'Science': [80, 88, 85],
'English': [75, 82, 79]
}
# Let's create a DataFrame with named indexes
df = pd.DataFrame(data, index=['Arun', 'Tarun', 'Varun'])
print(df.loc['Varun'])
If you see the output yourself, you can find it obvious to get the marks of Varun, for different subjects.
Well, DataFrames are very important in Pandas, and most of the times, you would have to deal with Pandas DataFrames, so please try to go through the concept of DataFrame again and again, so that you become familiar with this, and comfortably work with DataFrames ahead.