I use Python to split a huge database into multiple files. This is useful if you want to distribute different sets of data to various users. For this tutorial, I will be working with Excel files.
You will need:
Image: Contacts Masterlist
- For this tutorial, we'll work with this mock contacts list with 10k records.
Image: Excel Template that we will be using for this tutorial.
- Leave Column A blank and hide it. Python will write the index numbers here later on.
- The succeeding column headers should be the same with the masterlist, unless if you can modify/remove columns in pandas.
- Return all included values for your splits: Click here
- Write your filtered data in Excel: Click here
- Loop #1 and #2: Click here
- See the Results and the whole script!
Step 1: Create a method that returns the values you'll use as filter keys.
For this example, I want to split my data by Sales Reps.
Using pandas, I created a dataframe from my masterlist.
FILE_MASTERLIST = '/Users/eyanamallari/Projects/py_distribute_data/Input/CONTACTS_ALL_STATIC.xlsx' df = pd.read_excel(FILE_MASTERLIST,"CONTACTS_FINAL")
Here's the basic method for getting all unique Sales Reps from the dataframe:
def getSalesRep(): df['Sales Representative'].fillna('Unknown', inplace = True) print('Getting all Sales Representatives') df_roster = df['Sales Representative'].unique() print(df_roster) d = df_roster.tolist() d_final = d print(d_final) print(len(d_final)) return d_final
You may opt to include or exclude certain sales reps by adding these into your code:
EXCLUDE_LIST =  INCLUDE_LIST = ['Calista Rosales', 'Bianca Cardenas', 'Colette Black'] def getSalesRep(): ... # --- EXCLUDE REPS ---- #df_filtered = df[~df['Sales Representative'].isin(EXCLUDE_LIST)] #df_roster = df_filtered['Sales Representative'].unique() # --- INCLUDE REPS ---- df_filtered = df[df['Sales Representative'].isin(INCLUDE_LIST)] df_roster = df_filtered['Sales Representative'].unique() ...
Step 2: Create a method that writes filtered data to Excel.
We will be using pandas' Excelwriter for this step. We will also fill out the parameters
output_folder later on the next step.
def writeExcelFileByRep(owner_value, output_folder): owner = str(owner_value) # Filter by owner df_abridged = df[df['Sales Representative']==owner] rows_target = dataframe_to_rows(df_abridged) # ------------ # Write to Excel # ------------ FILE_PATH = output_folder print(FILE_PATH) book = load_workbook(FILE_PATH) writer = ExcelWriter(FILE_PATH, engine='openpyxl') writer.book = book for sheet in book.worksheets: if sheet.title == 'Contacts': # To clear out the sample records in the template for row in sheet['A1:H4']: for cell in row: cell.value = None # Replenish for r_idx, row in enumerate(rows_target, 1): for c_idx, value in enumerate(row, 1): sheet.cell(row=r_idx, column=c_idx, value=value) writer.save() writer.close() print("Done writing Excel file!")
Another option! Your computer hardware may not be able to keep up with your Python script's lightning speed.
To minimize interruptions, add the following snippet to force Python to just keep trying if any RAM-related errors occur.
def writeExcelFileByRep(owner_value, output_folder): ... constant_tries = 2000 # you can change this number tries = 2000 assert tries > 0 error = None result = None while tries: try: writer.save() writer.close() except IOError as e: error = e tries -= 1 print('Attempt #', (constant_tries-tries)+1) except ValueError as e: error = e tries -= 1 print('Attempt #', (constant_tries-tries)+1) else: break if not tries: print('Attempt #', (constant_tries-tries)+1) raise error print('Attempt #', (constant_tries-tries)+1) print("Done writing Excel file!")
Step 3: Create a method that loops the whole script by sales rep.
I personally call this the looper. The looper works by creating the files for all the returned list in Step 1.
def loopRosterCreateFiles(reps): print("Creating files!") print(reps) count = 1 for rep in reps: print('------') print(count, rep) count = count + 1 output_rep = "Contacts List - " + str(rep) output_folder = OUTPUT_DIRECTORY+ output_rep # Creates Folder for each rep if not os.path.exists(output_folder): os.makedirs(output_folder) print("Created folder for", rep) # Set File Names rep_excel_file = QUARTER+ " Contact List - "+str(rep)+ " " + VERSION +".xlsx".strip() rep_excel_file_no_ext = QUARTER+ " Contact List - "+str(rep)+ " " + VERSION.strip() rep_excel_path = output_folder+"/"+rep_excel_file # Copies template shutil.copy(FILE_TEMPLATE,rep_excel_path) # Writes Filtered Data to Excel writeExcelFileByRep(rep,rep_excel_path)
Now for the main method:
def main(): loopRosterCreateFiles(getSalesRep()) if __name__ == '__main__': main()
Here's how your terminal will look like when you run the script:
The resulting Excel files should look like this:
We were able to create an Excel file containing all records of this sales representative. Notice that the file name is customized as well.
Here's the whole script if you just wanna play with the code