I use Python to split a huge database into multiple files. This is useful if you want to distribute different sets of data to various users. For this tutorial, I will be working with Excel files.

You will need:

  1. Masterlist or Database you want to split: Download here
  2. Excel Template: Download here

Image: Contacts Masterlist

  • For this tutorial, we'll work with this mock contacts list with 10k records.

Image: Excel Template that we will be using for this tutorial.

  • Leave Column A blank and hide it. Python will write the index numbers here later on.
  • The succeeding column headers should be the same with the masterlist, unless if you can modify/remove columns in pandas.

Steps:

  1. Return all included values for your splits: Click here
  2. Write your filtered data in Excel: Click here
  3. Loop #1 and #2: Click here
  4. See the Results and the whole script!

Step 1: Create a method that returns the values you'll use as filter keys.

For this example, I want to split my data by Sales Reps.

Using pandas, I created a dataframe from my masterlist.

FILE_MASTERLIST = '/Users/eyanamallari/Projects/py_distribute_data/Input/CONTACTS_ALL_STATIC.xlsx'

df = pd.read_excel(FILE_MASTERLIST,"CONTACTS_FINAL")  

Here's the basic method for getting all unique Sales Reps from the dataframe:

def getSalesRep():  
    df['Sales Representative'].fillna('Unknown', inplace = True)
    print('Getting all Sales Representatives')

    df_roster = df['Sales Representative'].unique()

    print(df_roster)

    d = df_roster.tolist()

    d_final = d

    print(d_final)
    print(len(d_final))
    return d_final

You may opt to include or exclude certain sales reps by adding these into your code:

EXCLUDE_LIST = []  
INCLUDE_LIST = ['Calista Rosales', 'Bianca Cardenas', 'Colette Black']

def getSalesRep():  
    ...

    # --- EXCLUDE REPS ----
    #df_filtered = df[~df['Sales Representative'].isin(EXCLUDE_LIST)]
    #df_roster = df_filtered['Sales Representative'].unique()


    # --- INCLUDE REPS ----
    df_filtered = df[df['Sales Representative'].isin(INCLUDE_LIST)]
    df_roster = df_filtered['Sales Representative'].unique()

    ...



Step 2: Create a method that writes filtered data to Excel.

We will be using pandas' Excelwriter for this step. We will also fill out the parameters owner_value and output_folder later on the next step.

def writeExcelFileByRep(owner_value, output_folder):  
    owner = str(owner_value)

    # Filter by owner
    df_abridged = df[df['Sales Representative']==owner]
    rows_target = dataframe_to_rows(df_abridged)

    # ------------
    # Write to Excel
    # ------------

    FILE_PATH = output_folder
    print(FILE_PATH)

    book = load_workbook(FILE_PATH)
    writer = ExcelWriter(FILE_PATH, engine='openpyxl')

    writer.book = book

    for sheet in book.worksheets:
        if sheet.title == 'Contacts':

           # To clear out the sample records in the template
            for row in sheet['A1:H4']:
               for cell in row:
                   cell.value = None

            # Replenish
            for r_idx, row in enumerate(rows_target, 1):
                for c_idx, value in enumerate(row, 1):
                    sheet.cell(row=r_idx, column=c_idx, value=value)

   writer.save()
   writer.close()
   print("Done writing Excel file!")

Another option! Your computer hardware may not be able to keep up with your Python script's lightning speed.

To minimize interruptions, add the following snippet to force Python to just keep trying if any RAM-related errors occur.

def writeExcelFileByRep(owner_value, output_folder):  
    ...

    constant_tries = 2000 # you can change this number
    tries = 2000

    assert tries > 0
    error = None
    result = None

    while tries:
        try:
            writer.save()
            writer.close()
        except IOError as e:
            error = e
            tries -= 1
            print('Attempt #', (constant_tries-tries)+1)
        except ValueError as e:
            error = e
            tries -= 1
            print('Attempt #', (constant_tries-tries)+1)
        else:
            break
    if not tries:
        print('Attempt #', (constant_tries-tries)+1)
        raise error

    print('Attempt #', (constant_tries-tries)+1)

    print("Done writing Excel file!")



Step 3: Create a method that loops the whole script by sales rep.

I personally call this the looper. The looper works by creating the files for all the returned list in Step 1.

def loopRosterCreateFiles(reps):

    print("Creating files!")
    print(reps)

    count = 1

    for rep in reps:
        print('------')

        print(count, rep)

        count = count + 1

        output_rep = "Contacts List - " + str(rep)
        output_folder = OUTPUT_DIRECTORY+ output_rep

        # Creates Folder for each rep
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
        print("Created folder for", rep)

        # Set File Names
        rep_excel_file = QUARTER+ " Contact List - "+str(rep)+ " " + VERSION +".xlsx".strip()
        rep_excel_file_no_ext = QUARTER+ " Contact List - "+str(rep)+ " " + VERSION.strip()
        rep_excel_path = output_folder+"/"+rep_excel_file

        # Copies template
        shutil.copy(FILE_TEMPLATE,rep_excel_path)

        # Writes Filtered Data to Excel
        writeExcelFileByRep(rep,rep_excel_path)

Now for the main method:

def main():

    loopRosterCreateFiles(getSalesRep())

if __name__ == '__main__':  
    main()


Results

Here's how your terminal will look like when you run the script:

The resulting Excel files should look like this:

We were able to create an Excel file containing all records of this sales representative. Notice that the file name is customized as well.

Here's the whole script if you just wanna play with the code